14
Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen 1 , Yuchen Jin 1 , Bingyu Kong 2 , Matthai Philipose 3 , Arvind Krishnamurthy 1 , Ravi Sundaram 4 1 University of Washington, 2 Shanghai Jiao Tong University, 3 Microsoft, 4 Northeastern University Abstract We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelera- tors such as GPUs, it is essential to run them at sustained high utilization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, rea- soning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully implemented system that includes these innovations. On large-scale case studies on 16 GPUs, Nexus shows 1.8-41× better throughput than state-of-the-art systems while staying within latency constraints >99% of the time. 1 Introduction Consider a cloud-scale video analysis service that allows thou- sands of tenants to analyze thousands of streams each concur- rently. Increasingly, the core computations for this workload are Deep Neural Networks (DNNs), which are networks of dense linear algebra computations. Specialized hardware ac- celerators for DNNs, in the form of Graphic Processing Units (GPUs, which this paper focuses on) and even more special- ized Tensor Processing Units (TPUs) have emerged in the recent past. Accelerators are spec’ed to process DNNs orders of magnitude faster and cheaper than CPUs in many cases. However, GPUs are expensive and very-high-capacity: mod- ern devices each provide over 100 TFLOPS. Cost-savings from using them depends critically on operating them at sus- tained high utilization. A fundamental problem therefore is to distribute the large incoming workload onto a cluster of accel- erators at high accelerator utilization and acceptable latency. We address this problem in this paper. Conceptually, this can be thought of sharding inputs com- ing through a distributed frontend onto DNNs on "backend" GPUs. Several interacting factors complicate this viewpoint. First, given the size of GPUs, it is often necessary to place dif- ferent types of networks on the same GPU. It is then important to select and schedule them so as to maximize their combined throughput while satisfying latency bounds. Second, many applications consists of groups of DNNs that feed into each other. It is important to be able to specify these groups, and to schedule the execution of the entire group on the cluster so as to maximize performance. Third, it is well known that dense linear algebra computations such as DNNs execute much more efficiently when their inputs are batched together. Batching fundamentally complicates scheduling and routing because (a) it benefits from cross-tenant and cross-request co- ordination and (b) it forces the underlying bin-packing-based scheduling algorithms to incorporate batch size. Fourth, the increasingly common use of transfer learning in today’s work- loads has led to specialization of networks, where two tasks that formerly used identical networks now use networks that are only mostly identical. Since batching only works when multiple inputs are applied to the same model in conventional DNN execution systems, the benefits of batching are lost. Nexus is a GPU cluster for DNN execution that addresses these problems to attain high execution throughput under latency Service Level Objectives (SLOs). It uses three main techniques to do so. First, it relies on a novel batching-aware scheduler (Section 6.1) that performs bin packing when the balls being packed into bins have variable size, depending on the size of the batch they are in. This schedule specifies the GPUs needed, the distribution of DNNs across them and the order of their execution so as to maximize execution throughput while staying within latency bounds. Second, it allows groups of related DNN invocations to be written as queries and provides automated query optimization to assign optimal batch sizes to the components of the query so as to maximize overall execution throughput of the query while staying within its latency bounds (Section 6.2). Finally, Nexus breaks from orthodoxy and allows batching of parts of networks with different batch sizes. This enables the batched execution of specialized networks (Section 5). Nexus is completely implemented as a containerized sys- tem deployable on a commercial cloud, and comprises of roughly 10k lines of C++. We have deployed Nexus on a 64-GPU cluster. On focused 16-GPU experiments compared with existing DNN serving systems (Tensorflow Serving [6] and Clipper [8]), we show improvements in throughput of 1.8-4.4× on a traffic monitoring case study, and 21-41× on a game-stream analysis case study, while satisfying latency SLOs (i.e., achieving a “good rate” of) >99% of the time. On a much larger experiment on a 64-GPU cluster, 10 applica- tions and 7 different models, Nexus achieves a good rate of >99% while maintaining similar high throughputs.

Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis

Haichen Shen1, Yuchen Jin1, Bingyu Kong2, Matthai Philipose3,Arvind Krishnamurthy1, Ravi Sundaram4

1University of Washington, 2Shanghai Jiao Tong University, 3Microsoft, 4Northeastern University

AbstractWe address the problem of serving Deep Neural Networks(DNNs) efficiently from a cluster of GPUs. In order to realizethe promise of very low-cost processing made by accelera-tors such as GPUs, it is essential to run them at sustainedhigh utilization. Doing so requires cluster-scale resourcemanagement that performs detailed scheduling of GPUs, rea-soning about groups of DNN invocations that need to beco-scheduled, and moving from the conventional whole-DNNexecution model to executing fragments of DNNs. Nexus is afully implemented system that includes these innovations. Onlarge-scale case studies on 16 GPUs, Nexus shows 1.8-41×better throughput than state-of-the-art systems while stayingwithin latency constraints >99% of the time.

1 IntroductionConsider a cloud-scale video analysis service that allows thou-sands of tenants to analyze thousands of streams each concur-rently. Increasingly, the core computations for this workloadare Deep Neural Networks (DNNs), which are networks ofdense linear algebra computations. Specialized hardware ac-celerators for DNNs, in the form of Graphic Processing Units(GPUs, which this paper focuses on) and even more special-ized Tensor Processing Units (TPUs) have emerged in therecent past. Accelerators are spec’ed to process DNNs ordersof magnitude faster and cheaper than CPUs in many cases.However, GPUs are expensive and very-high-capacity: mod-ern devices each provide over 100 TFLOPS. Cost-savingsfrom using them depends critically on operating them at sus-tained high utilization. A fundamental problem therefore is todistribute the large incoming workload onto a cluster of accel-erators at high accelerator utilization and acceptable latency.We address this problem in this paper.

Conceptually, this can be thought of sharding inputs com-ing through a distributed frontend onto DNNs on "backend"GPUs. Several interacting factors complicate this viewpoint.First, given the size of GPUs, it is often necessary to place dif-ferent types of networks on the same GPU. It is then importantto select and schedule them so as to maximize their combinedthroughput while satisfying latency bounds. Second, manyapplications consists of groups of DNNs that feed into eachother. It is important to be able to specify these groups, andto schedule the execution of the entire group on the cluster

so as to maximize performance. Third, it is well known thatdense linear algebra computations such as DNNs executemuch more efficiently when their inputs are batched together.Batching fundamentally complicates scheduling and routingbecause (a) it benefits from cross-tenant and cross-request co-ordination and (b) it forces the underlying bin-packing-basedscheduling algorithms to incorporate batch size. Fourth, theincreasingly common use of transfer learning in today’s work-loads has led to specialization of networks, where two tasksthat formerly used identical networks now use networks thatare only mostly identical. Since batching only works whenmultiple inputs are applied to the same model in conventionalDNN execution systems, the benefits of batching are lost.

Nexus is a GPU cluster for DNN execution that addressesthese problems to attain high execution throughput underlatency Service Level Objectives (SLOs). It uses three maintechniques to do so. First, it relies on a novel batching-awarescheduler (Section 6.1) that performs bin packing when theballs being packed into bins have variable size, dependingon the size of the batch they are in. This schedule specifiesthe GPUs needed, the distribution of DNNs across them andthe order of their execution so as to maximize executionthroughput while staying within latency bounds. Second,it allows groups of related DNN invocations to be writtenas queries and provides automated query optimization toassign optimal batch sizes to the components of the queryso as to maximize overall execution throughput of the querywhile staying within its latency bounds (Section 6.2). Finally,Nexus breaks from orthodoxy and allows batching of parts ofnetworks with different batch sizes. This enables the batchedexecution of specialized networks (Section 5).

Nexus is completely implemented as a containerized sys-tem deployable on a commercial cloud, and comprises ofroughly 10k lines of C++. We have deployed Nexus on a64-GPU cluster. On focused 16-GPU experiments comparedwith existing DNN serving systems (Tensorflow Serving [6]and Clipper [8]), we show improvements in throughput of1.8-4.4× on a traffic monitoring case study, and 21-41× ona game-stream analysis case study, while satisfying latencySLOs (i.e., achieving a “good rate” of) >99% of the time. Ona much larger experiment on a 64-GPU cluster, 10 applica-tions and 7 different models, Nexus achieves a good rate of>99% while maintaining similar high throughputs.

Page 2: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

(a) Traffic monitoring (b) Game stream discovery (c) News indexing (d) Building access (e) Home monitoring

Figure 1: Example video streams. Windows to be analyzed by Nexus using DNNs are marked with boxes.

frame sampling

light-weight analysis

DNN-based

analysisaggregation

videostream

frame query + windows DNN outputs

video analysis outputs

Figure 2: A typical vision processing pipeline. Nexus is designed toprovide DNN-based analysis for tens of thousands of streams.

ModelCPU GPU CPU TPU GPUlat. lat. cost cost cost

(0.1TF (180TF (125TFpeak) peak) peak)

Lenet5 6ms <0.1ms $0.01 $0.00 $0.00VGG7 44 <1 0.13 0.01 0.00

Resnet50 1130 6.2 4.22 0.48 0.12Inception4 2110 7.0 8.09 0.93 0.23Darknet53 7210 26.3 24.74 2.85 0.70

Table 1: DNN execution latencies and estimated costs per 1000 invo-cations 2. Acceleration may be necessary to meet latency deadlines,but can also be cheaper, given low cost/TFLOPS (written TF above).

2 BackgroundNexus provides efficient and timely cloud-based accelerationof applications that analyze video. Figure 1 provides someexamples of the applications we target. For instance, a citymay have hundreds of traffic cameras, a game streamingsite [1] may index tens of thousands of concurrent streamsfor discovery, a media company may index dozens of livebroadcasts, and a home monitoring company may analyzecamera feeds from millions of customers. Nexus is designedto support many such users and applications simultaneously,pooling workloads across them to achieve efficiency. Below,we examine the structure of these applications and outline thechallenges and opportunities in serving them at scale.

2.1 Vision pipelines and DNN-based analysis

At the highest-level, a vision-based application aggregatesvisual information from one or more video streams usingcustom “business” logic. Each stream is processed using apipeline similar to that in Figure 2. CPU-based code, eitheron the edge or in the cloud, selects frames from the stream

2Per-device prices for 1000 invocations assuming peak execution rates onon-demand instances of AWS c5.large (Intel AVX 512), p2.xlarge (NVIDIAK80), p3.2xlarge (NVIDIA V100) and GCP Cloud TPU on 9/14/2018.

for processing, applies business logic to identify what parts(or windows) of the image need deeper analysis, applies aset of Deep Neural Networks (DNNs) (a query) to these win-dows, and aggregates the results in an application specificway, often writing to a database. A query may represent asingle DNN applied to the window, but often it may representa sequence of dependent DNN applications, e.g., running anobject detector on the window and running a car make/modeldetector on all sub-windows determined to be cars.

Typically, a stream is sampled a few times a second orminute, and the DNN query should complete execution intens to hundreds of milliseconds (for “live” applications) orwithin several hours for (“batch” applications). The executionof DNNs dominates the computation pipeline, and the costof executing them dominates the cost of the vision service.Nexus provides a standalone service that implements theDNN-based analysis stage for vision pipelines.

2.2 Accelerators and the challenge of utilizing them

As Table 1 shows, a key to minimizing the cost of executingDNNs is the use of specialized accelerators such as GPUsand TPUs (Tensor Processing Units, essentially specializedASICS), which are highly optimized to execute the denselinear algebra computations that comprise DNN models. Thetable shows the execution latency and the dollar cost of 1000invocations for a few common models on CPUs and GPUs.Execution times on CPUs can be orders of magnitude slowerthan that on GPUs. For many applications, therefore, latencyconstraints alone may dictate GPU-accelerated execution.

Perhaps more fundamentally, GPUs and TPUs promisemuch lower cost per operation than even highly acceleratedCPUs: Table 1 lower-bounds the cost of executing a modelby assuming that models can be executed at peak speed oneach platform. Even compared to state of the art CPUs, ac-celerators can yield a cost advantage of up to 9× (for TPUs)and 34× (for GPUs). On the other hand, accelerators haveextremely high computational capacity (e.g., 125 TFLOPSfor the NVIDIA V100). To realize their cost savings, it iscritical to sustain high utilization of this capacity. Sustaininghigh utilization is hard, however. For instance, the LeNetmodel of Table 1 consumes 20 MOPs to run, implying thata single V100 would require 125 TFLOPS ÷ 20 MOPs =6.25M inputs/second to run at full utilization!

No single stream, or even most applications, can yield suchrates. By aggregating inputs across streams, Nexus is de-

2

Page 3: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

0 10 20 30 40 50 60Batch size

102

103

104

Thro

ughp

ut (r

eqs/

s)

VGG-16

Inception

ResNet-50

LeNet

0 10 20 30 40 50 60Batch size

0

25

50

75

100

125

150

175

200

Late

ncy

(ms)

VGG-16InceptionResNet-50LeNet

Figure 3: Impact of batching DNN inputs on model executionthroughput and latency on an NVIDIA GTX1080.

signed to funnel adequate work to each accelerator. However,as we discuss next, having “enough” work is not sufficientto achieve high utilization: it is important to group the righttype of work in the right place.

2.3 Placing, packing and batching DNNs

DNNs are networks of dense linear algebra operations (e.g.,matrix multiplication and convolution), called layers or ker-nels. Networks are also called models. By default, the GPUsimply executes the kernels presented to it in the order re-ceived. The kernels themselves are often computationallyintensive, requiring MFLOPs to GFLOPs to execute, andrange in size from one MB to hundreds of MBs. These factshave important implications for GPU utilization.

First, loading models into memory can cost hundreds ofmilliseconds to seconds. When serving DNNs at high vol-ume, therefore, it is usually essential to place the DNN on aparticular GPU by pre-loading it on to GPU memory and thenre-using it across many subsequent invocations. Placementbrings with it the traditional problems of efficient packing.Which models should be co-located on each GPU, and howshould they be scheduled to minimize mutual interference?

Second, it is well known that processor utilization achievedby kernels depends critically upon batching, i.e., groupinginput matrices into higher-dimensional ones before applyingcustom “batched” implementations of the kernels. Intuitively,batching allows kernels to avoid stalling on memory accessesby operating on each loaded input many more times thanwithout batching. As Figure 3 shows, batching can improvethroughput of model execution significantly, e.g., improve-ments of 2.5-24× for batch sizes of 8-64 for the VGG andLeNet models relative to batch size 1, with intermediate gainsfor other models. Although batching increases executionlatency, it usually stays within application-level bounds.

Although batching is critical for utilization, it complicatesmany parts of the system design. Perhaps most fundamentally,the algorithm for packing models on GPUs needs to changebecause the resource quantum used on each input is “squishy”,i.e., it varies with the size of the batch within which that inputexecutes. Further, the latency of execution also depends onthe batch size. This new version of bin packing, which we dubsquishy bin packing, needs to reason explicitly about batching(Section 6.1). Batching also complicates query processing. If

a certain latency SLO (Service Level Objective) is allocatedto the query as a whole, the system needs to partition thelatency across the DNN invocations that comprise the queryso that each latency split allows efficient batched executionof the related DNN invocation (Section 6.2). We call thiscomplex query scheduling.

Finally, batching is conventionally only feasible when thesame model is invoked with different inputs. For instance, weexpect many applications to use the same well-known, gener-ally applicable, models (e.g., Resnet50 for object recognition).However, the generality of these models comes at the price ofhigher resource use. It has become common practice [22, 12]to use smaller models specialized (using “transfer learning”)to the few objects, faces, etc. relevant to an application byaltering (“re-training”) just the output layers of the models.Since such customization destroys the uniformity required byconventional batching, making specialized models play wellwith batching is often critical to efficiency.

3 Related WorkThe projects most closely related to Nexus are Clipper [8]and Tensorflow Serving [6]. Clipper is a “prediction servingsystem” that serves a variety of machine learning modelsincluding DNNs, on CPUs and GPUs. Given a request toserve a machine learning task, Clipper selects the type ofmodel to serve it, batches requests, and forwards the batchedrequests to a backend container. By batching requests, andadapting batch sizes online under a latency SLO, Clippertakes a significant step toward Nexus’s goal of maximizingserving throughput under latency constraints. Clipper alsoprovides approximation and caching services, complemen-tary to Nexus’s focus on executing all requests exactly butefficiently. Tensorflow Serving can be viewed as a variant ofClipper that does not provide approximation and caching, butalso has additional machinery for versioning models.

To the basic batched-execution architecture of Clipper,Nexus builds along the dimensions of scale, expressivity andgranularity. These techniques address the challenges broughtup earlier in this section, and thus reflect Nexus’s focus onexecuting DNNs on GPUs at high efficiency and scale.Scale: Nexus provides the machinery to scale serving to large,changing workloads. In particular, it automates the alloca-tion of resources (including GPUs) and the placement andscheduling of models across allocated resources. It providesa distributed frontend that scales to large numbers of requests,with a work-sharding technique that plays well with batchingrequests on the back end. These functions are performed ona continuing basis to adapt to workloads, with re-allocationand re-packing structured to cause minimum churn.Expressivity: Nexus provides a query mechanism that (a)allows related DNN execution tasks to be specified jointly,and (b) allows the user to specify the latency SLO just atthe whole-query level. Nexus then analyzes the query andallocates latency bounds and batch sizes to constituent DNN

3

Page 4: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

tasks so as to maximize the throughput of the whole query.Granularity: Where Clipper limits the granularity of batchedexecution to whole models, Nexus automatically identifiescommon subgraphs of models and executes them in a batch.This is critical for batching on specialized models, whichoften share all but the output layer, as described previously.

Serving DNNs at scale is similar to other large-scale short-task serving problems. These systems have distributed frontends that dispatch low-latency tasks to queues on the backendservers. Sparrow [24] focuses on dispatch strategies to reducethe delays associated with queuing in such systems. Slicer [5]provide a fast, fault-tolerant service for dividing the back endinto shards and load balancing across them. Both systemsassume that the backend server allocation and task placementis performed at a higher (application) level, using clusterresource managers such as Mesos [16] or Omega [28]. Nexusshares the philosophy of these systems of having a fast dataplane that dispatches incoming messages from the frontend tobackend GPUs and a slower control plane that performs moreheavyweight scheduling tasks, such as resource allocation,packing and load balancing. Also, the Nexus global schedulercommunicates with cluster resource managers to obtain orrelease resources.

Much work in the machine learning and systems commu-nity has focused on producing faster models. For instance,many model optimization techniques [26, 4, 33] producefaster versions of models, often at small losses in accuracy.Given a variety of models of varying accuracy, they can becombined to maintain high accuracy and performance, usingstatic or dynamic techniques [29, 20, 17, 14, 8]. Nexus viewsthe optimization, selection and combination models as bestdone by the application, and provides no special support forthese functions. On the other hand, once the models to be exe-cuted are selected, Nexus will execute them in a fast, efficientmanner.

4 ExamplesIn this section, we use two simple examples to explain theoptimization problems associated with scheduling DNN work-loads from Section 2.3. The first example explores squishybin packing, and the second, scheduling complex queries.

4.1 Squishy bin packing

Consider a workload that consists of three different typesof tasks that invoke different DNN models. Let the desiredlatency SLOs for tasks invoking models A, B, and C be 200ms,250ms, and 250ms, respectively. Table 2 provides the batchexecution latency and throughput at different batch sizes (i.e.,the “batching profile”) for each model.

We first explore the most basic scenario where all threetypes of tasks are associated with high request rates so thatmultiple GPUs are required to handle each task type. Tomaximize GPU efficiency, we need to choose the largestpossible batch size while still meeting the latency SLO. We

ModelBatch Batch ThroughputSize Lat (ms) (reqs/s)

A4 50 808 75 107

16 100 160

B4 50 808 90 89

16 125 128

C4 60 66.78 95 84

16 125 128

Table 2: Batching profiles for models used in the example. BatchLat is the latency for processing a batch.

note that the batch execution cost for a given task type cannotexceed half of the task’s latency SLO; a task that missed beingscheduled with a batch would be executed as part of the nextbatch, and its latency would be twice the batch execution cost.For example, the latency SLO of job 1 has 200 ms, so themaximum batch size that job 1 can use is 16, according toTable 2. Therefore, the maximum throughput that job 1 canachieve on a single GPU is 160 reqs/sec, and the number ofGPUs to be allocated for job 1 should r1/160, where r1 is therequest rate of job 1. Similarly, the number of GPUs for job 2and 3 should be r2/128 and r3/128, where r2 and r3 are therequest rates for jobs 2 and 3 respectively. Figure 4(a) depictsthe desired schedules for the three types of jobs.

We next consider a situation where the request rates for thejobs are not high and each job requires only a fraction of aGPU’s computing power. In this case, the scheduler needsto consolidate multiple types of DNN tasks onto the sameGPU to optimize resource utilization. Consider a workloadwhere job 1 receives 64 reqs/sec, and jobs 2 and 3 receive32 reqs/sec. We consider schedules wherein one or moremodel types are assigned to each GPU. A GPU then executesbatches of different types of tasks in a round robin manner,and it cycles through the different model types over a timeperiod that we refer to as the duty cycle. The worst caselatency for a task is no longer twice the batch execution costbut rather the sum of the duty cycle and the batch executioncost for that task type.

Given this setup, we observe that we can schedule modelA tasks in batches of 8 as part of a duty cycle of 125ms;the resulting throughput is the desired rate of 64 reqs/sec,the batch execution cost for 8 tasks is 75ms, and the worstcase execution delay of 200ms matches the latency SLO (seeFigure 4(b)). We then check whether the GPU has sufficientslack to accommodate tasks associated with models B or C.Within a duty cycle of 125ms, we would need to execute 4tasks of either B or C to meet the desired rate of 32 reqs/sec.The batch execution cost of 4 model B tasks is 50ms, whichcan fit into the residual slack in the duty cycle. On the otherhand, a batch of 4 model C tasks would incur 60ms and cannot

4

Page 5: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

batch 16

prev batch

prev batch

prev batch

batch 16

batch 16

100 ms

Job 2

Job 1

Job 3

200 ms

250 ms

125 ms

(a) Saturated workload

batch 8prev batch

batch 4

batch 4

duty cycle: 125 ms

Job 2

Job 1

Job 3

200 ms75 ms

60 ms

(b) Residual workload

Figure 4: Resource allocation example.

be scheduled inside the duty cycle. Further, the worst caselatency for model B is the sum of the duty cycle and its ownbatch execution cycle, 175ms(= 125+ 50), which is lowerthan its latency SLO 250ms. Thus, it is possible to co-locatemodels A and B on the same GPU, but not model C.

We now make a few observations regarding the scenariodiscussed above and why the associated optimization problemcannot be addressed directly by known scheduling algorithms.First, unlike vanilla bin-packing that would pack fixed-sizeballs into bins, here the tasks incur lower costs when multiplestasks of the same type are squished together into a GPU. Sec-ond, in addition to the capacity constraints associated with thecompute and/or memory capabilities associated with a GPU,there are also latency constraints in generating a valid sched-ule. Third, there are many degrees of freedom in generatinga valid schedule. The batch sizes associated with differentmodel executions is not only a function of the request ratebut also of the duty cycle in which the batch is embedded. InSection 6.1, we describe how to extend traditional algorithmsto accommodate these extra considerations.

4.2 Complex query scheduling

In the previous example, we considered simple applicationsthat use only one model, but, in practice, applications usu-ally comprise of dependent computations of multiple DNNmodels. For example, a common pattern is a detection andrecognition pipeline that first detects certain objects fromthe image and then recognizes each object. The developerwill specify a latency SLO for the entire query, but sincethe system would host and execute the constituent modelson different nodes, it would have to automatically derive la-tency SLOs for the invoked models and derive schedules thatmeet these latency SLOs. We discussed the latter issue in theprevious example, and we now focus on the former issue.

Consider a basic query that executes model X and feedsthe output of X to model Y, as shown in Figure 5. Suppose

YX⍺

Figure 5: Query pipeline

Model Batch Latency (ms) Throughput (reqs/s)

X40 20050 25060 300

Y40 30050 40060 500

(a)

Latency budget Avg Throughput (reqs/s)X Y α = 0.1 α = 1 α = 10

40 60 192.3 142.9 40.050 50 235.3 153.8 34.560 40 272.7 150.0 27.3

(b)

Table 3: Table (a) shows the batch execution latency and throughputof model X and Y used in the example query. Table (b) shows theaverage throughput with three latency split plans for varying α .

we have a 100ms latency budget for processing this query,and supposed that every invocation of X yields α outputs (onaverage). When α < 1, model X operates as a filter; whenα = 1, model X basically maps an input to an output; whenα > 1, model X yields multiple outputs from an input (e.g.,detection of multiple objects within a frame).

Assume that Table 3(a) depicts the batch execution latencyand throughput of models X and Y. The system has to decidewhat latency SLOs it has to enforce on each of the two typesof models such that the overall latency is within 100ms andthe GPU utilization of the query as a whole is maximized.For the purpose of this example, we consider a limited set oflatency split plans for models X and Y: (a) 40ms and 60ms,(b) 50ms and 50ms, (c) 60ms and 40ms. It would appear thatplan (a) should work best since the sum of the throughputsis largest among the three plans, but a closer examinationreveals some interesting details.

For workloads involving a large number of requests, letus assume that p and q GPUs execute model X and Y, re-spectively. We then have α · p ·TX = q ·TY , where TX and TYare throughputs of models X and Y, such that the pipelinewon’t be bottlenecked by any model. We define the averagethroughput as the pipeline throughput divided by the totalnumber of GPUs, which is p ·TX/(p+ q). We evaluate theaverage throughputs for the three latency split plans withα = 0.1,1,10. Table 3(b) shows that each of the three splitplans achieves the best performance for different α values. Infact, there is no universal best split: it depends on α , which

5

Page 6: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

Global Scheduler

Epoch scheduling

Backend

Application DockerApplication Container

Nexus Library

model A

GPU scheduler

Backend

User requestsCluster Manager

GPU GPU

GPU scheduler

Update latency split from query latency SLO

Workload stats

Determine whether models can be prefix batched

Perform squishy bin packing

prefix modelsuffix1 suffix2

common prefix

Monitor if workload changes

model B

Every epoch

Data flow Control flow

Frontend

Model Database

Model ingest

Figure 6: Nexus runtime system overview.

can vary over time.We learn two lessons from this example. First latency split

for complex query makes impact to the overall efficiency, andit is necessary to understand model batch performance andworkload statistics to make the best decision. Second, latencysplit should not be static but rather adapted over time in accor-dance with latest workload distribution. Section 6.2 describeshow analysis tool in Nexus automatically and continuallyderives latency splits for complex queries.

5 System OverviewFigure 6 provides an overview of Nexus. Nexus works onthree planes. The management plane allows developers toingest and deploy applications and models, at a timescale ofhours to weeks. The control plane, via the global scheduler, isresponsible for resource allocation and scheduling at a typicaltimescale of seconds to minutes. The data plane, comprised ofin-application Nexus library instances and backend modules(together, the Nexus runtime), dispatches and executes userrequests at the timescale of milliseconds to seconds. Theglobal scheduler interacts with the underlying cluster resourcemanager (e.g., Mesos [16], Azure Scale Sets [23]) to acquireCPUs/GPUs for the frontend/backend. A load balancer (notshown) from the underlying cluster spreads user requestsacross Nexus’s distributed frontend. We sketch the threeplanes.

Management plane: Developers may ingest models and ap-plications to Nexus. Models are stored in a model databaseand may be accompanied by either a sample data set or abatching profile (see Table 2). Nexus uses the sample dataset,if available, to derive a batching profile. Otherwise, the profileis updated at runtime based on user requests. Applicationsare containers whose entry points invoke Nexus’s App API(Figure 7). Figure 8 shows a sample traffic monitoring app sat-isfying the interface.3 Developers store application containersin cluster-provided container repositories and may instruct

3Apps shouldn’t to contain all the business logic of an application. Rather,they just specify the logic to invoke the DNNs to process an input window.

classModelHandler:#AsyncRPC,executesthemodelonremotebackends,#andreturnsanoutputhandlerimmediatelydefExecute(self,input_data,output_fields)

classApp:#Sendsloadmodelrequesttoglobalscheduler,#andreturnsaModelHandlerdefGetModelHandler(self,framework,model_name,version)#ImplementedbydeveloperdefSetup(self)#ImplementedbydeveloperdefProcess(self,request)

Figure 7: Nexus application API.

classTrafficApp(App):defSetup(self):self.yolo=self.GetModelHandler("darknet","yolo",2)self.car_rec=self.GetModelHandler("tensorflow","googlenet_car",1)self.face_rec=self.GetModelHandler("caffe","vgg_face",1)defProcess(self,request):persons,car_models=[],[]forobjinself.yolo.Execute(request.image):ifobj["class"]=="person":persons.append(self.face_rec.Execute(request.image[obj["rect"]]))elifobj["class"]=="car":cars.append(self.car_rec.Execute(request.image[obj["rect"]]))returnReply(request.user_id,persons,cars)

Figure 8: Traffic monitoring application.

Nexus to ingest a container, at which point it is loaded fromthe repository onto a frontend CPU.

Control plane: The global scheduler is a cluster-wide re-source manager that uses load and usage statistics from theruntime. It uses this profile to add or remove frontend andbackend nodes from the cluster, invokes the epoch schedulerto decide which models to execute and at what batch size, andwhich backend to place the models on so as to balance loadand maximize utilization. Multiple models may be mappedonto a single backend, albeit with an execution schedule thatensures they do not interfere as in Section 4.1. The mappingfrom models to backends where they are to be executed iscaptured in a routing table that is broadcast to frontends. Thematching execution schedule for each backend is captured ina schedule that is communicated to backends. On receiving arouting table, frontends update their current routing table. Onreceiving a schedule, backends load appropriate models intoGPU memory and set their execution schedule.

Allocation, scheduling and routing updates happen at thegranularity of an epoch, typically 30-60s, although a newepoch can also be triggered by large changes in workload.Epoch scheduling involves the following:

• Produce an updated split of the latency SLO based on α

values estimated from latest profile (see Section 6.2).• Combine two or more models that share a prefix and latency

SLO into a new prefix-batched model.• Perform profile-guided squishy bin packing to allocate the

GPU resources for each model. (Section 6.1 has details).

6

Page 7: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

Notation Description

Mk DNN model kSi Session iLi Latency constraint for session SiRi Request rate for session Si

`k(b) Execution cost for Mk and batch size b

Table 4: Notation

Data plane: When a user request comes into (a replica of)an application container, the application invokes DNNs viathe Execute method from the Nexus Library. The libraryconsults the local routing table to find a suitable backendfor that model, dispatches the request to the backend, anddelivers responses back to the application. The application isresponsible for packaging and delivering the end-result of thequery to the user. A backend module uses multiple threads toqueue requests from various frontends, selects and executesmodels on these inputs in batched mode according to thecurrent schedule, and sends back the results to frontends. Itcan utilize one or more GPUs on a given node, with eachGPU managed by a GPU scheduler that schedules tasks on it.

6 Global SchedulerWe now describe the algorithms used by the global sched-uler as part of its epoch scheduling. The goal of the globalscheduler is to minimize the number of GPUs allocated toprocessing the workload without violating latency SLOs. Wepresent the algorithms in three steps. First, we consider thecase of scheduling streams of individual DNN task requests,given their expected arrival rates and latency SLOs (akin tothe example discussed in Section 4.1. We then consider howto schedule streams of more complex queries/jobs that in-voke multiple DNN tasks (similar to the example discussedin Section 4.2).

6.1 Scheduling streams of individual DNN tasks

We now consider scheduling streams of individual DNN tasksand build upon the discussion we presented in Section 4.1.The scheduler identifies for each cluster node the modelshosted by the node would serve and the target serving through-puts for those models such that the schedule is computation-ally feasible and also does not violate the latency SLOs. Asdiscussed earlier, the scheduling problem has the structureof the bin-packing problem [2], but the solution is harder asdifferent batching thresholds incur different per-task costsand different co-locations of model types on a node result indifferent worst-case model execution latencies. The "squishi-ness" of tasks and the need to meet latency SLOs are theconsiderations that we address in the algorithm presentedbelow.

Inputs and Notation: The scheduling algorithm is providedthe following inputs.

• It is provided the request rate of invocations for a given

model at a given latency SLO. We refer to the requestsfor a given model and latency SLO as a session. Notethat a session would correspond to classification requestsfrom different users and possibly different applications thatinvoke the model at a given latency constraint. Table 4describes the notation used below. Formally, a session Sispecifies a model Mki and a latency constraint Li, and thereis a request rate Ri associated with it.

• The algorithm is also provided with the execution costsof different models at varying batch sizes. The latency ofexecuting a set of MK invocation of size b is referred to as`k(b). We assume that throughput is non-decreasing withregard to batch size b.

Scheduling Overview: The scheduler allocates one or moresessions to each GPU and specifies their corresponding targetbatch sizes. Each GPU node n is then expected to cyclethrough the sessions allocated to it, execute invocations ofeach model in batched mode, and complete one entire cycleof batched executions within a duty cycle of dn. For sessionsthat have a sufficient number of user requests, one or moreGPU nodes are allocated to a single session to execute modelinvocations. The algorithm described below computes theresidual workload for such sessions after allocating an integralnumber of GPUs and then attempts to perform bin packingwith the remaining sessions. For the bin packing process,the scheduler inspects each session in isolation and computesthe largest batch size and the corresponding duty cycle forthe executing GPU in order to meet the throughput and SLOneeds. The intuition behind choosing the largest batch sizeis to have an initial schedule wherein the GPU operates atthe highest efficiency. This initial schedule, however, isn’tcost effective as it assumes that each GPU is running just onesession within its duty cycle, so the algorithm then attempts tomerge multiple sessions within a GPU’s duty cycle. In doingso, it should not violate the latency SLOs, so we require themerging process to only reduce the duty cycle of the combinedallocation. The algorithm considers sessions in decreasingorder of associated work and merges them into existing dutycycles that have the highest allocations, thus following thedesign principle behind the best-fit decreasing technique fortraditional bin packing.

Details: We now describe the global scheduling algorithm fora given workload (which is also depicted in Algorithm 1).

Consider a session Si that uses model Mki and has latencyconstraint Li. We first consider executing the model Mki solelyin one GPU (Line 1-7 in Algorithm 1). Suppose the batch sizeis b, execution latency of model Mki is `ki(b), and the worstcase latency for any given request is 2`ki(b), as we explainedin Section 4.1. Denote batch size Bi as the maximum valuefor all b that holds the constraint 2`ki(b) ≤ Li. Therefore,the maximal throughput, denoted by Ti, of model Mki withlatency constraint Li on a single GPU is Bi/`ki(Bi). Thenumber of GPU nodes we allocate to execute just Si requests

7

Page 8: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

Algorithm 1 Global Scheduling AlgorithmGLOBALSCHEDULE(Sessions)

1: residue_loads←{}2: for Si = 〈Mki ,Li,Ri〉 in Sessions do3: Bi← argmaxb(2`ki(b)≤ Li)4: Ti← Bi/`ki(Bi)5: let Ri = n ·Ti + ri6: assign n GPU nodes to execute Mki with batch Bi7: residue_loads← residue_load ⊕〈Mki ,Li,ri〉8: for 〈Mki ,Li,ri〉 in residue_loads do9: bi← argmaxb(`ki(b)+b/ri ≤ Li)

10: di← bi/ri11: occi← `ki(bi)/di

12: sort residue_loads by occi in descending order13: nodes←{}14: for 〈Mki ,Li,ri,bi,di,occi〉 in residue_loads do15: max_occ← 016: max_node← NULL17: for n = 〈b,d,occ〉 in nodes do18: n′← MERGENODES(n, 〈bi,di,occi〉)19: if n′ 6= NULL and n′.occ > max_occ then20: max_occ← n′.occ21: max_node← n′

22: if max_node 6= NULL then23: replace max_node for its original node in nodes24: else25: nodes← nodes ⊕〈bi,di,occi〉

is n = bRi/Tic. Note that n = 0 for sessions that don’t havesufficient requests to utilize an entire GPU.

As a consequence, there will be a residual unallocated loadfor session Si after taking into account this allocated load.The next phase assigns this residual load to one other node inthe system (Line 8-11 in Algorithm 1). Denote ri = Ri−n ·Tias the request rate of residue load. Suppose we execute theresidual load with batch size b, the duty cycle for gatheringb inputs d is b/ri. Then, the worst case latency is d + `ki(b).Therefore, we have the constraint:

d + `ki(b) = b/ri + `ki(b)≤ Li (1)

We begin residual load scheduling by choosing for sessionSi with residual load ri the maximum batch size bi that sat-isfies the above constraint. Correspondingly duty cycle d isalso at its maximal value. Note that this batch size maximizesGPU efficiency for the given latency constraint due to thefollowing argument. Denote occupancy (occ) as the fractionof the duty cycle d occupied by Si’s residual load invocations:occi(b) = `ki(b)/d.

Next, we start to merge these fractional nodes into fewernodes (Line 12-25 in Algorithm 1). This part resemblesthe classic bin packing algorithm that first sorts sessions bydecreasing occupancy and merge two nodes on to a singlenode by best fit. The primary difference is how to determinewhether two nodes can be merged such that all sessions won’t

duty cycle d1

Node 2

Node 1

MergedNode

batch b2

Node 2

Node 1

Figure 9: Merge two nodes into one. Use the smaller duty cycle asnew duty cycle for both nodes. Update the batch size accordinglyand re-estimates the batch execution latency. If sum of exec latencydoesn’t exceed new duty cycle, two nodes can be merged into one.

violate their latency SLOs. Figure 9 depicts the process tomerge two nodes. Suppose we have two sessions S1 and S2,with request rates r1 and r2, assigned batch sizes b1 and b2,and duty cycles d1 and d2. We use d = min(d1,d2) as the newduty cycle, since di is the maximum duty cycle allowed foreach session. Without loss of generality, we assume d = d2.We then use b′1 = d · r1 ≤ b1 as the new batch size for execut-ing session S1. Note that the worst case latency of requestsin session S1 now becomes d + `k1(b

′1) ≤ d1 + `k1(b1) ≤ Li,

and we won’t violate the latency constraint for S1 by thisadjustment. If `k1(b

′1)+ `k2(b2) ≤ d and memory capacity

permits, a single node can handle the computation of both S1and S2, and we allocate these two sessions to the same node.While the above discussion considers merging two sessions,the underlying principle generalizes to the situation where asession is merged with a set of sessions executing on a givennode.

Finally, we extend the algorithm to be incremental acrossepochs, thus minimizing the movement of models acrossnodes. If the overall workload decreases, the scheduler at-tempts to move sessions from the least utilized backends toother backends. If a backend no longer executes any session,the scheduler reclaims this backend and relinquishes it to thecluster manager. If workload increases such that a backendbecomes overloaded, we evict the cheapest sessions on thisbackend until it is no longer overloaded. We then performbin packing again to place these evicted sessions to otherbackends.

6.2 Scheduling Complex Queries

We now present the query analysis algorithm that operateson dataflow representations of application queries in order todetermine the latency SLO splits for the constituent models.The output of this analysis is given as input to the schedulingalgorithm of Section 6.1 that works with individual models.

The query analysis algorithm extracts the dataflow repre-sentations from the application code (e.g., the left one de-picted in Figure 10 corresponding to Figure 8) and then com-pute the stage number of each constituent model as the maxi-

8

Page 9: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

input

LeNet Res-Net

output

(a) Game

input

face car

output

SSD

(b) Traffic

Figure 10: Dataflow representations of two example applications.

mum depth of the model invocation with respect to the input.For example, in Figure 10(a), both the LeNet and ResNetmodels have a stage number of 1, and in Figure 10(b), theSSD model is at stage 1 while the face and car recognitionmodels are in stage 2. Models at the same stage will share thesame latency SLO.

In addition, we also collect profiling information fromthe runtime execution of the query. For each edge in thedependency graph Mi→M j, the frontends collect informationregarding how many instances of M j is invoked by a single Miinstance. This information is reported to the global scheduler,which aggregates the information.

We now define the optimization objective for the analysis.Suppose there are k models in the query, Mi(1≤ i≤ k) andthat Mi is associated with a latency SLO of Li, with mod-els at the same stage having the same SLO. Further, for adependency edge Mi → M j, let αi j indicate the number oftimes M j is invoked by an instance of Mi. Also, let Ni be thenumber of GPUs allocated for model Mi. Then, we have theconstraint that αi jNiTi(Li) = N jTj(L j), where Ti is the maxthroughput of model Mi given the latency SLO Li. Giventhese constraints and the target throughput associated withthe top-level model, we can compute the GPU allocations fora latency split plan. We then choose the latency split planthat minimizes ∑Ni. For queries with only a few stages, theanalysis tool just uses brute force to scan all possible splitsand return the best; otherwise, it uses simulated annealing []to search for the best latency SLO split plan within time limit.

7 EvaluationWe implemented Nexus in C++ with 10k lines of code. Nexussupports the execution of models trained by various frame-works including Caffe [19], Caffe2 [10], Tensorflow [3], andDarknet [27]. Nexus can be deployed in a cluster usingDocker Swarm [9] or Kubernetes [13].

To evaluate Nexus, we first use two case studies of real-world applications to compare the end-to-end performanceagainst Clipper and Tensorflow serving (denoted as TF serv-ing below). We then perform a set of experiments to quantifythe performance gains of each optimization feature in Nexus.The experiments are performed on a cluster of 16 Nvidia GTX1080Ti GPUs.

7.1 Methodology

For any system and any given workload, we measure the max-imum request processing rate such that 99% of the requestsare served within their latency SLOs and refer to this mea-surement as the system’s maximum throughput. This metricreflects the GPU efficiency that a system can achieve on agiven workload.

Note that neither Clipper nor TF serving are cluster-widesolutions. We provide a batch-oblivious scheduler as a base-line for both Clipper and Tensorflow serving. It works asfollows. We first profile the maximum throughput achievedby Clipper or TF serving on a single node for a given modeland latency SLO. GPU shares for each model session is cal-culated to be proportional to its request rate and inverselyproportional to the maximum model throughput measured inthe previous step. We then use a greedy algorithm to allocatemodels to GPUs according to their GPU shares. For com-plex queries, since Clipper and TF serving does not providefunctionality for analyzing query performance, we evenlysplit the latency SLO across the different stages and use thislatency split plan in our profiling step. We also use the fol-lowing node-level configurations to instantiate Clipper andTF serving.• Clipper: Clipper encapsulates each model in a docker con-

tainer. If the batch-oblivious resource allocator assignsmultiple models to a single GPU, we just launch multipledocker containers on this GPU. The Clipper frontend pro-vides a load balancing mechanism. Therefore, we rely onClipper to load balance requests to multiple replicas of amodel.

• TF serving: We cannot specify a latency SLO in TF serv-ing. Instead, we pick a maximum batch size for each modelbased on its profile, so that TF serving doesn’t violate itslatency SLO. TF serving also doesn’t provide a frontendto load balance the requests to the replicas. We thereforebuilt a frontend for TF serving and dispatched the requeststo backends according to the mapping generated by theresource allocator.

7.2 Case Study

7.2.1 Live Game Analysis

In this case study, we evaluate the application that analyzesframes from live game streams. On each frame, we need torecognize 6 digits and one object icon. We invoke LeNet[21] 6 times to recognize digits and ResNet-50 [15] once torecognize the icon. ResNet is specialized to each game byre-training the last layer of the model, while all games sharethe same LeNet. We include 20 games in the case study,and consider a latency SLO of 50ms. The request rates offrames from the 20 games follow the Zipf-0.9 distribution.We perform this experiment on 16 GPUs. During the experi-ment, we noticed that both Clipper and TF serving performspoorly when the query invokes the tiny LeNet model, as the

9

Page 10: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

Clipper TFserving

Nexusbatch

oblivious

Nexusno prefix

batch

Nexus0

200400600800

100012001400

Thro

ughp

ut (r

eqs/

s) GameGame w/o LeNet

Figure 11: Compare the throughput of Nexus, Clipper, and TFserving for 20 live game analysis applications on 16 GPUs. “Nexusbatch oblivious” measures throughput of Nexus which uses batch-oblivious scheduler. “Nexus no prefix batch” basically disablesprefix batching in Nexus.

Day Night0

100

200

300

400

500

600

Thro

ughp

ut (r

eqs/

s) ClipperTF servingNexus

Figure 12: Throughput for traffic analysis applications on 16 GPUs.

scheduler allocates too many GPUs for LeNet. Therefore, wealso evaluated Clipper and TF serving on a modified gameapplication that does not invoke LeNet but rather use all 16GPUs for ResNet.

Figure 11 depicts the end-to-end throughput of Clipper, TFserviing, and Nexus. We include two variants of Nexus totease out the benefits obtained from some of our techniques:(a) a variant of Nexus that uses batch-oblivious schedulerto perform resource allocation, and (b) a variant of Nexusthat disables prefix batching. Overall, Nexus achieves 21×and 41× the throughput compared to TF serving and Clip-per respectively on the original game application. Moreover,TF serving and Clipper can process only 3-4 times fewerrequests even after we remove LeNet from the application.Comparing across different Nexus variants, our global sched-uler achieves 27% higher throughput compared to the batch-oblivious scheduler, and prefix batching bumps up throughputby an extra 16%.

7.2.2 Traffic Monitoring

The traffic monitoring application analyzes live video streamsfrom many traffic cameras. It first detects objects usingSSD [7] and recognizes the make and model of cars and facesusing GoogleNet-car [32] and VGG-Face [25], respectively.Figure 8 shows the query code for the traffic app. The latencySLO of this query is 400ms. We measure the throughput forthis application using traffic feeds obtained during both day

time and night time. Figure 12 shows that Nexus achieves1.8× (day) and 2.9× (night) throughput compared to TF serv-ing, and 4.4× (day) and 2.2× (night) throughput compared toClipper. Nexus is able to adapt the latency split for the SSDmodel from 302ms in day time to 345ms at night becausefewer cars and pedestrians appear in the frame. Therefore, wesee a greater increase in throughput of Nexus at night thanthat of TF serving.

7.3 Microbenchmark

We perform several microbenchmark to evaluate the through-put improvement for each feature used in Nexus.

7.3.1 Single GPU performance

On a single GPU, we compare the performance of Nexus,Clipper, and Tensorflow Serving for supporting multiple mod-els on a single GPU. In addition, Nexus performs prefix batch-ing when it detects different DNN models share commonprefix to further improve the throughput.

GPU Multiplexing. Figure 13 compares the throughputwhen executing multiple models on a single GPU. Three fac-tors affect the performance of GPU multiplexing: numberof models that share the GPU, latency SLO of models, andthe model architecture. By default, we use the Inception [31]model, latency SLO of 100ms, and deploy 3 models on thesame GPU. We include one variant of Nexus in this exper-iment, denoted as Nexus-parallel, which executes modelsin parallel to quantify how much GPU interference affectsthe throughput. Figure 13(a) compares the throughput rang-ing from 2 models to 5 models on one GPU. The resultsshow that Nexus achieves 10%–30% higher throughput thanNexus-parallel. Interference becomes more severe when moremodels contend on the same GPU. Figure 13(b) comparesthe throughput while varying the latency SLO from 50ms to200ms. When latency SLO becomes higher, Nexus-paralleltends to achieve higher throughput relatively because there ismore slack to tolerate interference. We also repeat the experi-ment for other model architectures including SqueezeNet [18],ResNet-50 [15], and VGG-16 [30] (shown in Figure 13(c)).We observe that larger models tend to suffer more from inter-ference.

Nexus achieves 1.4–2.1× throughput compared to TF serv-ing, and 1.9–9.8× throughput compared to Clipper on a singleGPU. Because Clipper executes a model in a docker containerand there is no coordination mechanism among containers,Clipper suffers from interference between models. We cansee that Clipper’s throughput reduces significantly when moremodels are allocated on the same GPU or the latency SLObecomes more constrained. TF serving executes multiplemodels in a round robin way, same as Nexus. But TF servingdoesn’t overlap the pre- and post-processing in CPU withmodel execution in GPU. TF serving also waits for a targetbatch size to be filled up until a timeout. Both of these causethe GPU to idle, and correspondingly harms the throughput.

10

Page 11: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

2 3 4 5Number of models

0

200

400

600

800

1000

1200

Thro

ughp

ut (r

eqs/

sec)

Clipper TF serving

(a)

50 100 150 200Latency SLO (ms)

0

200

400

600

800

1000

1200

1400

Thro

ughp

ut (r

eqs/

sec)

Nexus-parallel Nexus

(b)

SqueezeNet1.7GFLOPs

Inception3.2GFLOPs

ResNet-508.2GFLOPs

VGG-1630.8GFLOPs

0

200

400

600

800

1000

1200

1400

1600

Thro

ughp

ut (r

eqs/

sec)

ClipperTF servingNeuxs-parallelNexus

(c)

Figure 13: Compare the throughput when having multiple models on a single GPU among Clipper, Tensorflow Serving, Nexus variant whichexecutes models in parallel (Nexus-parallel), and standard Nexus, which executes models in round-robin manner. (a) compares the throughputof Inception models under different number of models on one GPU where latency SLO is 100ms. (b) compares the throughput of Inceptionmodels under different latency SLO when there are 3 models on one GPU. (c) compares the throughput of different model architecture wherelatency SLO is 100ms and 3 models on one GPU.

2 3 4 5 6 7 8 9 10Number of models

0%

20%

40%

60%

80%

100%

120%

Thro

ughp

ut im

prov

emen

t

(a)

50 100 150 200Latency SLO (ms)

0%

5%

10%

15%

20%

25%

30%

35%

40%

Thro

ughp

ut im

prov

emen

t

(b)

SqueezeNet Inception ResNet-50 VGG-160%

25%50%75%

100%125%150%175%200%

Thro

ughp

ut im

prov

emen

t n=3n=5n=10

(c)

Figure 14: Compare the throughput improvement of batching the common prefix of model variants against running these models separately onone GPU. (a) and (b) both use Inception model. (a) compares throughput improvement for different number of models that share commonprefix with latency SLO 100ms, whereas (b) compares throughput improvement under different latency SLO when 3 models share prefix. (c)shows the throughput improvement for different model architecture under 100ms latency SLO for 3, 5, and 10 models that share prefix.

Prefix Batching. Figure 14 evaluates prefix batching againstexecuting the models separately on one GPU. Figure 14(a)depicts the throughput improvement by varying the numberof Inception model variants that differ only in the last layerand with a latency SLO 100ms. It shows that prefix batchingimproves the throughput by up to 125%. Prefix batching pri-marily benefits from using a larger batch size for the commonprefix; otherwise, models are executed at much smaller batchsizes when more models are multiplexed on to a GPU. Sim-ilarly, when latency SLO becomes smaller, prefix batchingprovides more throughput gains, as shown in Figure 14(b).Figure 14(c) applies prefix batching on different model archi-tectures when there are 3, 5, and 10 models. Prefix batchingdoesn’t improve the throughput for SqueezeNet when thereare 3 or 5 model variants, because it is a much smaller model,and they are able to saturate GPU efficiency even withoutprefix batching. For other settings, prefix batching improvesthroughput by up to 3x.

Note that as prefix batching executes different suffixes of

model variants sequentially, it introduces certain overheads ifthe suffix includes heavy-weight computations. Figure 15(a)shows the throughput of prefix-batched models varying thelength of the suffix from the last layer to three fully-connectedlayers and varying the number of models. When the suffixcontains no more than two fully-connected layers, execu-tion of suffixes imposes less than 5% overhead even with 10models. But, because the third from last layer is quite heavy-weight, we can see that prefix batching only achieves 72%of throughput for 10 models compared to no prefix batch-ing. Another benefit brought by prefix batching is memorysavings since we only need to allocate one copy for the com-mon prefix. Figure 15(b) reveals that the memory use ofprefix batching grows sub-linearly with number of models,whereas the GPU runs out of memory at 8 models withoutprefix batching.

11

Page 12: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

2 4 6 8 10Number of models

0

50

100

150

200

250

300

350

Thro

ughp

ut (r

eqs/

sec)

1 FC2 FC3 FC

(a)

2 4 6 8 10Number of models

0

2000

4000

6000

8000

10000

Mem

ory

(MB)

1 FC2 FC3 FC

No prefixbatch

(b)

Figure 15: Evaluate the overhead and memory use of prefix batchingfor VGG-16 varying suffix lengths. k-FC means that suffix containslast k fully-connected layers. (a) Throughput of prefix batching fordifferent number of models and suffix length. (b) Memory footageof prefix batching with regard to number of models and suffix length.Black line shows the memory use without prefix batching.

Mix SLOsInception

Mix SLOsResNet

Mix ratesInception

Mix ratesResNet

Mix models& SLOs

0.00

0.25

0.50

0.75

1.00

1.25

1.50

Rela

tive

thro

ughp

ut

6736

3040

3710

1271

4736

Baseline Nexus

Figure 16: Compare the throughput between Nexus and Nexus usingbatch-oblivious resource allocator.

7.3.2 Resource Allocation on Multiple GPUs

We now compare the global scheduler in Nexus against batch-oblivious scheduler used for Clipper and TF serving. Wemeasure the throughput of standard Nexus and Nexus usingbatch-oblivious scheduler as baseline. Both need to allocate16 sessions on 8 GPUs under 5 scenarios: (a) 16 Inceptionor ResNet models with mixed SLOs ranging from 50ms to200ms, (b) 16 Inception or ResNet models with mixed re-quest rates following Zipf-0.9 distribution, (c) 8 differentmodel architectures, each associated with two SLOs, 50msand 100ms. Figure 16 depicts the relative throughput ofstandard Nexus with regard to baseline. Nexus outperformsbaseline by 11–64% because our global scheduler is awareof “squishy” performance of batch execution while batch-oblivious scheduler is prone to overloading a backend byplacing excessive workload.

7.3.3 Complex Query

To evaluate the performance gain of the query analyzer, wecompare the throughput of Nexus with and without the queryanalyzer. The baseline simply splits the latency SLO evenlyacross the various stages in the query. The query includes

300ms 400ms 500ms= 0.1 = 1 = 10 = 0.1 = 1 = 10 = 0.1 = 1 = 10

Query latency SLO

0

50

100

150

200

250

300

350

Thro

ughp

ut (r

eqs/

sec)

Baseline Nexus

Figure 17: Compare the throughput between Nexus with and withoutquery analyzer. The query in this experiment includes two models,SSD and Inception. We vary the latency SLO and α in the query,where α indicates the number of times Inception is invoked by SSD.

0

2000

4000

Wor

kloa

d (re

qs/s

)

0

20

40

60Nu

mbe

r of G

PUs

0 100 200 300 400 500 600 700 800 900Time (s)

0.0%

50.0%

100.0%

Good

rate

Figure 18: Changing workload of 10 applications over time on 64GPUs.

two stages: (a) first stage executes SSD, and then (b) invokesInception model for α times. The experiment is performed on8 GPUs. We vary the latency SLO from 300ms to 500ms, andchoose α to be 0.1, 1, and 10. Figure 17 shows that Nexuswith the query analyzer achieves 13–55% higher throughputthan the baseline.

7.4 Large scale deployment with changing workload

We deploy Nexus on a cluster of 64 GPUs and run 10 appli-cations with mixed models and latency SLOs for 15 minuteswith changing workload. Figure 18 demonstrates that Nexusis able to scale up when workload increases, and consolidateworkload and recycle GPUs when workload decreases, whilemaintaining high good rate for all applications.

8 ConclusionWe proposed a scalable and efficient system design for servingDeep Neural Network (DNN) applications. Instead of servingthe entire application in an opaque CPU-based container withmodels embedded in it, which leads to sub-optimal GPU

12

Page 13: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

utilization, our system operates directly on models and GPUS.This design enables several optimizations in batching andallows more efficient resource allocation. Our system is fullyimplemented, in C++ and evaluation shows that Nexus canachieve 1.4-41× more throughput relative to state-of-the-artbaselines while staying within latency constraints (achievinga “good rate”) >99% of the time.

References[1] Twitch. https://www.twitch.tv/.

[2] Bin-Packing, pages 449–465. Springer Berlin Heidel-berg, Berlin, Heidelberg, 2008.

[3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,et al. Tensorflow: A system for large-scale machinelearning. In OSDI, volume 16, pages 265–283, 2016.

[4] A. Abdelfattah, A. Haidar, S. Tomov, and J. Dongarra.Fast cholesky factorization on gpus for batch and nativemodes in magma. Journal of Computational Science,20:85–93, 2017.

[5] A. Adya, D. Myers, J. Howell, J. Elson, C. Meek,V. Khemani, S. Fulger, P. Gu, L. Bhuvanagiri, J. Hunter,et al. Slicer: Auto-sharding for datacenter applications.In OSDI, pages 739–753, 2016.

[6] D. Baylor, E. Breck, and H.-T. Cheng. Tfx: Atensorflow-based production-scale machine learningplatform, 2017.

[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,and A. L. Yuille. Deeplab: Semantic image segmenta-tion with deep convolutional nets, atrous convolution,and fully connected crfs. IEEE transactions on pat-tern analysis and machine intelligence, 40(4):834–848,2018.

[8] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E.Gonzalez, and I. Stoica. Clipper: A low-latency onlineprediction serving system. In NSDI, pages 613–627,2017.

[9] Docker. Docker swarm. https://github.com/docker/swarm.

[10] Facebook. Caffe2. https://caffe2.ai/.

[11] M. R. Garey and D. S. Johnson. Computers and In-tractability: A Guide to the Theory of NP-Completeness.W. H. Freeman & Co., New York, NY, USA, 1979.

[12] Google. Cloud automl vision. https://cloud.google.com/vision/automl/docs/.

[13] Google. Kubernetes: Production-grade container or-chestration. https://kubernetes.io/.

[14] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman,and A. Krishnamurthy. Mcdnn: An approximation-based execution framework for deep stream processingunder resource constraints. In Proceedings of the 14thAnnual International Conference on Mobile Systems,Applications, and Services, pages 123–136. ACM, 2016.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recog-nition, pages 770–778, 2016.

[16] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica.Mesos: A platform for fine-grained resource sharingin the data center. In NSDI, volume 11, pages 22–22,2011.

[17] K. Hsieh, G. Ananthanarayanan, P. Bodik, P. Bahl,M. Philipose, P. B. Gibbons, and O. Mutlu. Focus:Querying large video datasets with low latency and lowcost. arXiv preprint arXiv:1801.03493, 2018.

[18] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and< 0.5 mb modelsize. arXiv preprint arXiv:1602.07360, 2016.

[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Con-volutional architecture for fast feature embedding. InProceedings of the 22nd ACM international conferenceon Multimedia, pages 675–678. ACM, 2014.

[20] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Za-haria. Noscope: optimizing neural network queries overvideo at scale. Proceedings of the VLDB Endowment,10(11):1586–1597, 2017.

[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recogni-tion. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[22] Microsoft. Custom vision. https://azure.microsoft.com/en-us/services/cognitive-services/custom-vision-service/.

[23] Microsoft. Virtual machine scale sets.https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-overview.

[24] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica.Sparrow: distributed, low latency scheduling. In Pro-ceedings of the Twenty-Fourth ACM Symposium on Op-erating Systems Principles, pages 69–84. ACM, 2013.

13

Page 14: Nexus: A GPU Cluster for Accelerating Neural Networks for ... · Nexus: A GPU Cluster for Accelerating Neural Networks for Video Analysis Haichen Shen1, Yuchen Jin1, Bingyu Kong2,

[25] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deepface recognition. In BMVC, volume 1, page 6, 2015.

[26] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.Xnor-net: Imagenet classification using binary convo-lutional neural networks. In European Conference onComputer Vision, pages 525–542. Springer, 2016.

[27] J. Redmon. Darknet: Open source neural networks in c.http://pjreddie.com/darknet/, 2013–2016.

[28] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, andJ. Wilkes. Omega: flexible, scalable schedulers forlarge compute clusters. In Proceedings of the 8th ACMEuropean Conference on Computer Systems, pages 351–364. ACM, 2013.

[29] H. Shen, S. Han, M. Philipose, and A. Krishnamurthy.Fast video classification via adaptive cascading of deepmodels. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), July 2017.

[30] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014.

[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich. Going deeper with convolutions. In Proceed-ings of the IEEE conference on computer vision andpattern recognition, pages 1–9, 2015.

[32] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and ver-ification. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 3973–3981, 2015.

[33] D. Yu, F. Seide, G. Li, and L. Deng. Exploiting sparse-ness in deep neural networks for large vocabulary speechrecognition. In Proceedings of IEEE International Con-ference on Acoustics, Speech, and Signal Processing(ICASSP), 2012.

A Hardness of Fixed-rate GPU SchedulingProblem (FGSP)

We now justify the use of an approximate algorithm for GPUcluster scheduling. We define the Fixed-rate GPU SchedulingProblem (FGSP), which is a highly restricted version of thegeneral problem, and we show that even the restricted versionis intractable.FGSP:Input - models Mi,1≤ i≤ n with corresponding latencies Li,latency bounds Bi and GPU count C. (The latencies corre-spond to the fixed rates.)Output - Partition of the models into C sets so that in each setS we have D+Li ≤ Bi,∀i ∈ S where D = ∑i∈S Li is the dutycycle for the set.

We show that FGSP is strongly NP-hard by reduction from3-PARTITION [11].

Theorem 1. FGSP is strongly NP-complete.

Proof. We start with a given instance of 3-PARTITION whichconsists of a bound B and 3n B

4 ≤ a1,a2 . . . ,a3n ≤ B2 ; the goal

of 3-PARTITION is to partition the ais into triples such thatthe sum of each triple is B. Observe that wlog we may assumethat ∑1≤i≤3n ai = nB.

From the given instance of 3-PARTITION we create aninstance of FGSP by setting Li = 2B+ai,Bi = 9B+ai,∀1≤i≤ 3n, C = n.

It is clear that if there exists a solution to the 3-PARTITIONinstance then the same partition into n triples yields a partitionof the FGSP instance into C = n sets so that D+Li ≤ 9B+aisince D = 7B and Li = 2B+ai. In the other direction supposethere exists a solution to FGSP. Observe that in any solution toFGSP every set can have at most 3 models because otherwisethe duty cycle D would exceed 8B and then the constraintD + Li ≤ Bi would be violated for any i in the set, sinceD+ Li > 10B but Bi < 10B. Since there are a total of 3nmodels and C = n sets every set must have exactly 3 models,i.e. every set must be a triple. Since D+Li ≤ Bi for any i inthe set, we have that D+2B+ai ≤ 9B+ai or D ≤ 7B. Butthis implies that in every triple the sum of the Lis is at most7B or the sum of the corresponding ais is at most B. But sincethe sum of all the n triples is nB and each triple is at most B itmust be that the sum of each triple is exactly B. This meansthat the partition of models of the FGSP instance into sets isalso a solution for the partition of the corresponding ai intotriples in 3-PARTITION.

14