14
Henge: Intent-driven Multi-Tenant Stream Processing Faria Kalim University of Illinois at Urbana Champaign [email protected] Le Xu University of Illinois at Urbana Champaign [email protected] Sharanya Bathey Amazon.com Services, Inc. [email protected] Richa Meherwal phData, Inc. [email protected] Indranil Gupta University of Illinois at Urbana Champaign [email protected] ABSTRACT We present Henge, a system to support intent-based multi-tenancy in modern distributed stream processing systems. Henge supports multi-tenancy as a first-class citizen: everyone in an organization can now submit their stream processing jobs to a single, shared, consolidated cluster. Secondly, Henge allows each job to specify its own intents (i.e., requirements) as a Service Level Objective (SLO) that captures latency and/or throughput needs. In such an intent- driven multi-tenant cluster, the Henge scheduler adapts continually to meet jobs’ respective SLOs in spite of limited cluster resources, and under dynamically varying workloads. SLOs are soft and are based on utility functions. Henge’s overall goal is to maximize the total system utility achieved by all jobs in the system. Henge is integrated into Apache Storm and we present experimental results using both production jobs from Yahoo! and real datasets from Twitter. CCS CONCEPTS Computer systems organization Distributed architectures; KEYWORDS Stream Processing, Multi-Tenancy, Resource Management, Intents, Service Level Objectives ACM Reference Format: Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, and Indranil Gupta. 2018. Henge: Intent-driven Multi-Tenant Stream Processing. In Proceedings of SoCC ’18: ACM Symposium on Cloud Computing, Carlsbad, CA, USA, October 11–13, 2018 (SoCC ’18), 14 pages. https://doi.org/10.1145/3267809.3267832 1 INTRODUCTION Modern distributed stream processing systems use a cluster to pro- cess continuously-arriving data streams in real time, from Web data to social network streams. Multiple companies use Apache Storm [4] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6011-1/18/10. . . $15.00 https://doi.org/10.1145/3267809.3267832 (Yahoo!/Oath, Weather Channel, Alibaba, Baidu, WebMD, etc.), Twitter uses Heron [54], LinkedIn relies on Samza [3] and oth- ers use Apache Flink [1]. These systems provide high-throughput and low-latency processing of streaming data from advertisement pipelines (internal use at Yahoo!), social network posts (LinkedIn, Twitter), geospatial data (Twitter), etc. While stream processing systems for clusters have been around for decades [18, 35], neither classical nor modern distributed stream processing systems support intent-driven multi-tenancy. We tease apart these two terms. First, multi-tenancy allows multiple jobs to share a single consolidated cluster. Because this capability is lacking in stream processing systems today, many companies (e.g., Yahoo! [6]) over-provision the stream processing cluster and then physically apportion it among tenants (often based on team priority). Besides higher cost, this entails manual administration of multiple clusters, caps on allocation by the sysadmin, and manual monitoring of job behavior by each deployer. Multi-tenancy is attractive as it reduces acquisition costs and allows sysadmins to only manage a single consolidated cluster. In in- dustry terms, multi-tenancy reduces capital and operational expenses (Capex & Opex), lowers total cost of ownership (TCO), increases resource utilization, and allows jobs to elastically scale based on needs. Multi-tenancy has been explored for areas such as key-value stores [73], storage systems [81], batch processing [80], and oth- ers [58], yet it remains a vital need in modern stream processing systems. Secondly, we believe the deployer of each job should be able to clearly specify their performance expectations as an intent to the system, and it is the underlying engine’s responsibility to meet this intent. This alleviates the developer’s burden of continually monitoring their job and guessing how many resources it needs. Modern distributed stream processing systems are very primitive and do not admit intents. We allow each job in a multi-tenant environment to specify its own intent as a Service Level Objective (SLO) [44]. It is critical that the metrics in an SLO be user-facing, i.e., understandable and settable by lay users such as a deployer who is not intimately familiar with the innards of the system and scheduler. For instance, SLO metrics can capture latency and throughput expectations. SLOs should not include internal metrics like queue lengths or CPU utilization as these can vary depending on the software, cluster, and job mix 1 . 1 However, these latter metrics can be monitored and used internally by the scheduler for self-adaptation. 249

Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream ProcessingFaria Kalim

University of Illinois at UrbanaChampaign

[email protected]

Le XuUniversity of Illinois at Urbana

[email protected]

Sharanya BatheyAmazon.com Services, Inc.

[email protected]

Richa MeherwalphData, Inc.

[email protected]

Indranil GuptaUniversity of Illinois at Urbana

[email protected]

ABSTRACTWe present Henge, a system to support intent-based multi-tenancyin modern distributed stream processing systems. Henge supportsmulti-tenancy as a first-class citizen: everyone in an organizationcan now submit their stream processing jobs to a single, shared,consolidated cluster. Secondly, Henge allows each job to specify itsown intents (i.e., requirements) as a Service Level Objective (SLO)that captures latency and/or throughput needs. In such an intent-driven multi-tenant cluster, the Henge scheduler adapts continuallyto meet jobs’ respective SLOs in spite of limited cluster resources,and under dynamically varying workloads. SLOs are soft and arebased on utility functions. Henge’s overall goal is to maximize thetotal system utility achieved by all jobs in the system. Henge isintegrated into Apache Storm and we present experimental resultsusing both production jobs from Yahoo! and real datasets fromTwitter.

CCS CONCEPTS• Computer systems organization → Distributed architectures;

KEYWORDSStream Processing, Multi-Tenancy, Resource Management, Intents,Service Level Objectives

ACM Reference Format:Faria Kalim, Le Xu, Sharanya Bathey, Richa Meherwal, and Indranil Gupta.2018. Henge: Intent-driven Multi-Tenant Stream Processing. In Proceedingsof SoCC ’18: ACM Symposium on Cloud Computing, Carlsbad, CA, USA,October 11–13, 2018 (SoCC ’18), 14 pages.https://doi.org/10.1145/3267809.3267832

1 INTRODUCTIONModern distributed stream processing systems use a cluster to pro-cess continuously-arriving data streams in real time, from Web datato social network streams. Multiple companies use Apache Storm [4]

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, October 11–13, 2018, Carlsbad, CA, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-6011-1/18/10. . . $15.00https://doi.org/10.1145/3267809.3267832

(Yahoo!/Oath, Weather Channel, Alibaba, Baidu, WebMD, etc.),Twitter uses Heron [54], LinkedIn relies on Samza [3] and oth-ers use Apache Flink [1]. These systems provide high-throughputand low-latency processing of streaming data from advertisementpipelines (internal use at Yahoo!), social network posts (LinkedIn,Twitter), geospatial data (Twitter), etc.

While stream processing systems for clusters have been aroundfor decades [18, 35], neither classical nor modern distributed streamprocessing systems support intent-driven multi-tenancy. We teaseapart these two terms. First, multi-tenancy allows multiple jobsto share a single consolidated cluster. Because this capability islacking in stream processing systems today, many companies (e.g.,Yahoo! [6]) over-provision the stream processing cluster and thenphysically apportion it among tenants (often based on team priority).Besides higher cost, this entails manual administration of multipleclusters, caps on allocation by the sysadmin, and manual monitoringof job behavior by each deployer.

Multi-tenancy is attractive as it reduces acquisition costs andallows sysadmins to only manage a single consolidated cluster. In in-dustry terms, multi-tenancy reduces capital and operational expenses(Capex & Opex), lowers total cost of ownership (TCO), increasesresource utilization, and allows jobs to elastically scale based onneeds. Multi-tenancy has been explored for areas such as key-valuestores [73], storage systems [81], batch processing [80], and oth-ers [58], yet it remains a vital need in modern stream processingsystems.

Secondly, we believe the deployer of each job should be ableto clearly specify their performance expectations as an intent tothe system, and it is the underlying engine’s responsibility to meetthis intent. This alleviates the developer’s burden of continuallymonitoring their job and guessing how many resources it needs.Modern distributed stream processing systems are very primitiveand do not admit intents.

We allow each job in a multi-tenant environment to specify its ownintent as a Service Level Objective (SLO) [44]. It is critical that themetrics in an SLO be user-facing, i.e., understandable and settableby lay users such as a deployer who is not intimately familiar withthe innards of the system and scheduler. For instance, SLO metricscan capture latency and throughput expectations. SLOs should notinclude internal metrics like queue lengths or CPU utilization asthese can vary depending on the software, cluster, and job mix1.

1However, these latter metrics can be monitored and used internally by the schedulerfor self-adaptation.

249

Page 2: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

Business Use Case SLO Type & Value

BloombergHigh Frequency Trading Latency < Tens of msUpdating top-k recent news articles on website Latency < 1 min.Combining updates into one email sent per subscriber Throughput > 40K messages/s [64]

UberDetermining price of a ride on the fly, identifying surge periods Latency < 5 sAnalyzing earnings over time Throughput > 10K rides/hr. [16]

The WeatherChannel

Monitoring natural disasters in real-time Latency < 30 sProcessing collected data for forecasts Throughput > 100K messages/min. [10]

WebMDMonitoring blogs to provide real-time updates Latency < 10 min.Search indexing related websites Throughput: index new sites at the rate found

E-CommerceWebsites

Counting ad-clicks Latency: update click counts every secondProcessing logs at Alipay Throughput > 6 TB/day [15]

Table 1: Stream Processing: Use Cases and Possible SLO Types.

SCHEDULERS JOBS MULTI-TENANT?

USER-FACING SLOS?

MESOS [41] General ✓ ✗ Uses Reservations: CPU, Mem, Disk, PortsYARN [80] General ✓ ✗ Uses Reservations: CPU, Mem, DiskAURORA [17] Streaming ✓ ✓ For Single Node Environment OnlyBOREALIS [18] Streaming ✗ ✓ Latency, Throughput, OthersR-STORM [66] Streaming ✗ ✗ Schedules based on: CPU, Mem, Band-

widthHENGE Streaming ✓ ✓ Latency, Throughput, Hybrid

Table 2: Henge vs. Existing Multi-Tenant Schedulers.

We believe lay users should not have to grapple with such complexmetrics.

While there are myriad ways to specify SLOs (including po-tentially declarative languages paralleling SQL), our paper is bestseen as one contributing mechanisms that are pivotal for building atruly intent-based stream processing system. Our latency SLOs andthroughput SLOs are immediately useful. Time-sensitive jobs (e.g.,those related to an ongoing ad campaign) are latency-sensitive andwill specify latency SLOs, while longer running jobs (e.g., sentimentanalysis of trending topics) will have throughput SLOs. Table 1 sum-marizes several real use cases spanning different SLO requirements.

We present Henge, the first scheduler to support both multi-tenancy and per-job intents (SLOs) for modern stream processingengines. Henge has to tackle several challenges. In a cluster of lim-ited resources, Henge needs to continually adapt to meet jobs’ SLOs,both under natural system fluctuations, and input rate changes dueto diurnal patterns or sudden spikes. While attempting to maximizethe system’s overall SLO achievement, Henge needs to prevent non-performing topologies from hogging cluster resources. It needs toscale with cluster size and jobs, and work well under failures.

Table 2 compares Henge with multi-tenant schedulers that aregeneric (Mesos, Yarn), as well as those that are stream processing-specific (Aurora, Borealis and R-Storm). Generic schedulers largelyuse reservation-based approaches to specify intents. A reservation isan explicit request to hold a specified amount of cluster resourcesfor a given duration [29]. Besides not being user-facing, reservationsare known to be hard to estimate even for a job with a static work-load [45], let alone the dynamic workloads prevalent in streamingapplications. Classical stream processing systems are either lim-ited to a single node environment (Aurora), or lack multi-tenant

implementations (e.g., Borealis has a multi-tenant proposal, but noassociated implementation). R-Storm [66] is resource-aware Stormthat adapts jobs based on CPU, memory, and bandwidth, but doesnot support user-facing SLOs.

Fig. 1: Benefits of enabling Henge in Apache Storm.

Fig. 1 depicts the twin benefits from Henge: dollar cost savingsfrom multi-tenancy, and higher SLO satisfaction. Concretely, in theworkload considered (5 jobs), x=100% represents the minimumresources needed in a single-tenant version of Storm (each job getsits own cluster) to reach 100% utility, i.e., satisfy all SLOs. First,compared to single-tenant Storm, Henge can save 60% of resourcecosts (x=40%) and still reach 93.5% utility. It can satisfy 100%of SLOs at resource savings of 40% (x=60%). Second, comparedto multi-tenant vanilla Storm, Henge achieves significantly higherSLO satisfaction (e.g., 6.7× better at x=40%). Considering that thestreaming analytics market is expected to grow to $13.7 billion by2021 [59], the 40%-60% dollar cost savings from Henge will have asignificant impact on profit margins.

This paper makes the following contributions:

• We present Henge, the first scheduler for intent-based multi-tenancy in distributed stream processing.

• We define and calculate a new throughput SLO metric that isinput-rate-independent, called “juice".

• We define SLOs via utility functions.

• We describe Henge’s implementation in Storm [4].

250

Page 3: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream Processing SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA

• We evaluate Henge using workloads from Yahoo! productionStorm topologies, and Twitter datasets.

2 HENGE SUMMARYWe now summarize the novel ideas in Henge.Juice: As input rates vary over time, specifying a throughput SLOas an absolute value is impractical. We define a new input rate-independent metric for throughput SLOs called juice. We show howHenge calculates juice for arbitrary topologies (Section 6).

Juice lies in the interval [0, 1] and captures the ratio of processingrate to input rate: a value of 1.0 is ideal and implies that the rateof incoming tuples equals rate of tuples processed by the job. Con-versely, a value less than 1 indicates that tuples are building up inqueues, waiting to be processed.

Throughput SLOs contain a minimum threshold for juice, makingthe SLO independent of input rate. We consider processing rateinstead of output rate as this generalizes to cases where input tuplesmay be filtered or modified: thus, they affect results but are neveroutputted.SLOs: A job’s SLO can capture either latency or juice (or a combina-tion of both). The SLO has: a) a threshold (min-juice or max-latency),and b) a job priority. Henge combines these via a user-specifiableutility function, inspired by soft real-time systems [52]. The utilityfunction maps current achieved performance (latency or juice) to avalue that represents the current benefit to the job. Thus, the functioncaptures the developer intent that a job attains full “utility” if itsSLO threshold is met and partial utility if not. Our utility functionsare monotonic: the closer the job is to its SLO threshold, the higherits achieved maximum possible utility (Section 4).State Space Exploration: Moving resources in a live cluster ischallenging. It entails a state space exploration where every stephas both: 1) a significant realization cost, as moving resources takestime and affects jobs, and 2) a convergence cost, since the systemneeds time to converge to steady state after a step. Henge adopts aconservatively online approach where the next step is planned, exe-cuted in the system, then the system is allowed to converge, and thestep’s effect is evaluated. Then, the cycle repeats. This conservativeexploration is a good match for modern stream processing clustersbecause they are unpredictable and dynamic. Offline exploration(e.g. simulated annealing) is time consuming and may make deci-sions on a cluster using stale information (as the cluster has movedon). Conversely, an aggressive online approach will over-react tochanges, and cause more problems than it solves.

The primary actions in our state machine (Section 5) are: 1)Reconfiguration (give resources to jobs missing SLO), 2) Reduction(take resources away from overprovisioned jobs satisfying SLO),and 3) Reversion (give up an exploration path and revert to past goodconfiguration). Henge gives jobs additional resources proportionateto how congested they are. Highly intrusive actions like reductionare kept small in number and frequency.Maximizing System Utility: Via these actions, Henge attemptsto continually improve each individual job and converge it to itsmaximum achievable utility. Henge is amenable to different goalsfor the cluster: either maximizing the minimum utility across jobs,or maximizing the total achieved utility across all jobs. While theformer focuses on fairness, the latter allows more cost-effective use

of the cluster, which is especially useful since revenue is associatedwith total utility of all jobs. Thus, Henge adopts the goal of maxi-mizing total achieved utility summed across all jobs. Our approachcreates a weak form of Pareto efficiency [83]; in a system wherejobs compete for resources, Henge transfers resources among jobsonly if this will cause the total cluster’s utility to rise.Preventing Resource Hogging: Topologies with stringent SLOsmay try to take over all the resources of the cluster. To mitigate this,Henge prefers giving resources to topologies that: a) are farthestfrom their SLOs, and b) continue to show utility improvements dueto recent Henge actions. This spreads resources across all wantingjobs and mitigates starvation and resource hogging.

3 BACKGROUNDWe describe our system model, fitting modern stream processingengines like Storm [79], Heron [54], Flink [1] and Samza [3]. Streamprocessing jobs are long-running. A stream processing job can belogically interpreted as a topology, i.e., a directed acyclic graph ofoperators (called bolts in Apache Storm)2. An operator is a logicalprocessing unit that applies user-defined functions on a stream oftuples. Source operators (called spouts) pull input tuples from anexternal source (e.g., Kafka [53]), while sink operators spew outputtuples (e.g., to affect dashboards). The sum of all spouts’ input ratesis the input rate of the topology, while the sum of all sinks’ outputrates is the output rate of the topology. Each operator is parallelizedvia multiple tasks. Fig. 2 shows an example topology.

Spout Bolt A

Bolt C

Bolt B

Bolt D

Bolt C is congested and only processes 6000 tuples in time unit

10000 tuples

10000 tuples

8000 tuples

8000 tuples

6000 tuples

8000 tuples

Bolt A filters out 2000 tuples and sends 8000 tuples along each edge

Fig. 2: Sample Storm topology. Tuples processed per unit time. Edge la-bels indicate number of tuples sent out by the parent operator to the child.(Congestion described in Section 5.)

A topology is run on worker processes executing on each machinein the cluster. Each worker process runs executors (threads) whichrun tasks specific to one operator. Each thread processes streamingdata, one tuple at a time, forwarding output tuples to its child opera-tors. Tuples are shuffled among threads (our system is agnostic).Definitions: An outputted tuple O’s latency is the time betweenit being outputted at the sink, and the arrival time of the latestinput tuple that directly influenced it. A topology’s latency is thenthe average latency of tuples it outputs over a window of time. Atopology’s throughput is the number of tuples processed per unittime.

4 SLOS AND UTILITY FUNCTIONSA Service Level Objective (SLO) [13] in Henge is user-facing andcan be set without knowledge of internal cluster details. An SLOspecifies: a) an SLO threshold (min-throughput or max-latency);2We use the terms job and topology interchangeably in this paper.

251

Page 4: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

and b) a job priority. Henge girds these requirements together into autility function.

Utility functions are commonplace in distributed systems [57, 70,89, 90]. In Henge, each job can have its own utility function—theuser can either reuse a stock utility function (like the knee functionwe define below), or define their own. Henge’s utility function fora job maps the current performance of the job to a current utilityvalue. Utility functions are attractive as they abstract away the typeof SLO metric that each topology has. Our approach allows Hengeto manage in a compatible manner a cluster with jobs of variousSLO types, SLO thresholds, priorities, and utility functions.

Currently, Henge supports both latency SLOs and throughputSLOs (and hybrids thereof). (While Section 6 discusses juice asan input-rate-independent throughput metric, here we assume anarbitrary throughput metric.)

A utility function can also capture partial utility when the SLOrequirement is not being met. Our only requirement for utility func-tions is that they must be monotonic. This captures more utility closerto the SLO threshold. For a job with a latency SLO, the utility func-tion must be monotonically non-increasing as latency rises. For a jobwith a throughput SLO, it must be monotonically non-decreasing asthroughput rises.

A utility function typically has a maximum utility, achieved whenthe SLO threshold is met, e.g., a job with an SLO threshold of 100ms achieves maximum utility only if its latency is below 100 ms. Tostay monotonic, as latency grows above 100 ms, the utility functioncan drop or plateau, but not rise.

The maximum utility value is based on (e.g., proportional to) jobpriority. For instance, in Fig. 3a, topology T2 has twice the priority ofT1, and thus has twice the maximum utility (20 vs. 10). We assumethat priorities are fixed at submission time—although Henge cangeneralize (Section 5.5).

Expected Utility of T1

Throughput (or Juice) SLO

Current Utility of T2

Current Utility of T1

Expected Utility of T2Expected Utility of T3

Latency SLO

Current Utility of T3

Priority(T1) = PPriority(T2) = 2P

a) b)

Fig. 3: Knee Utility functions for: (a) throughput, (b) latency.

Given these requirements, Henge allows a variety of utility func-tions: linear, piece-wise linear, step functions, lognormal, etc. Utilityfunctions may not be continuous.

Users can pick any utility functions that are monotonic. For con-creteness, our Henge implementation uses a piece-wise linear utilityfunction called a knee function. A knee function (Fig. 3) has twoparts: a plateau after the SLO threshold, and a sub-SLO for whenthe job does not meet the threshold. Concretely, the achieved utilityfor jobs with throughput and latency SLOs respectively, are:

Current Utility

Job Max U til ity=min(1,

Current Throuдhput Metr icSLO Throuдhput Threshold

) (1)

Current Utility

Job Max U til ity=min(1,

SLO Latency ThresholdCurrent Latency

) (2)

The sub-SLO is the last term inside “min". For throughput SLOs,the sub-SLO is linear and arises from the origin point. For latencySLOs, the sub-SLO is hyperbolic (y ∝ 1

x ), allowing increasinglysmaller utilities as latencies rise. Fig. 3 shows an example of athroughput SLO (Fig. 3a) and a latency SLO (Fig. 3b).

A utility function approach can also be used in Service LevelAgreements (SLAs), but these are beyond our scope.

5 HENGE STATE MACHINEHenge uses a state machine for the entire cluster (Fig. 4). A clusteris Converged if and only if either: a) all topologies have reachedtheir max utility (i.e., satisfy their SLO thresholds), or b) Hengerecognizes that no further actions will improve any topology’s per-formance, and thus it reverts to the last best configuration. The lastclause guarantees that if external factors remain constant–topologiesare not added and input rates stay constant–Henge will converge. Allother cluster states are Not Converged.

To move among these two states, Henge uses three actions: Re-configuration, Reduction, and Reversion.

Reconfiguration or

Reduction Converged

Total Current Utility < Total Max Utility

Reversion or ReconfigurationNot

Converged

Fig. 4: Henge’s State Machine for the Cluster.

5.1 ReconfigurationIn the Not Converged state, a Reconfiguration action gives resourcesto topologies missing their SLO. Reconfigurations occur in rounds,executed periodically (currently 10 s). In each round, Henge sortsall topologies missing their SLOs, in descending order of their maxi-mum utility, with ties broken by preferring lower current utility. Itthen greedily picks the head of this sorted queue to allocate resourcesto. The performance of a chosen topology can be improved by givingmore resources to all of its operators that are congested i.e., thosethat have insufficient resources, and thus are bottlenecks.Measuring Congestion: We use operator capacity [14] to measurecongestion. (Henge is amenable to using other congestion metrics,e.g., input queue sizes or ETP [87].) Operator capacity captures thefraction of time that an operator spends processing tuples in a timeunit. Its values lie in the range [0.0, 1.0]. If an executor’s capacity isnear 1.0, then it is close to being congested.

Consider an executor E that runs several tasks of a topologyoperator. Its capacity is calculated as:

CapacityE =Executed TuplesE × Execute LatencyE

Unit T ime(3)

where Unit Time is a time window. The numerator multiplies thenumber of tuples executed in this window and their average execu-tion latency to calculate the total time spent in executing those tuples.

252

Page 5: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream Processing SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA

Dividing the numerator by the time window tells us what fraction oftime the executor was busy processing tuples. The operator capacityis then the maximum capacity across all of its executors. Hengeconsiders an operator to be congested if its capacity is above thethreshold of Operator Capacity Threshold = 0.3. This low valueincreases the pool of possibilities, as more operators become candi-dates for receiving resources.Allocating Thread Resources: Henge allocates each congestedoperator an additional number of threads that is proportional to itscongestion level based on the following equation:(

Current Operator CapacityOperator Capacity Threshold

− 1)× 10 (4)

Reconfiguration is Conservative: Henge deploys a configurationchange to a single topology on the cluster, and waits for the measuredutilities to quiesce (this typically takes a minute in our configura-tions). No further actions are taken in the interim. It then measuresthe total cluster utility again, and if utility improved, Henge con-tinues its operations in further rounds, in the Not Converged State.If total utility reaches the maximum value (the sum of maximumutilities of all topologies), then Henge continues monitoring the re-cently configured topologies for a while (4 subsequent rounds in oursetting). If they remain stable, Henge moves to the Converged state.

A topology may improve only marginally after receiving moreresources in a reconfiguration. If a job’s utility increases by less thana factor (1 + ∆), Henge retains the reconfiguration but skips thisparticular topology in the near future rounds. ∆ = 5% in our imple-mentation. This topology may have plateaued in terms of benefitingfrom more threads. As the cluster is dynamic, the topology’s black-listing is allowed to expire after a while (1 hour in our settings), afterwhich the topology again becomes a candidate for reconfiguration.

As reconfigurations are exploratory steps in the state space search,total system utility might decrease after a step. Henge employs twoactions called Reduction and Reversion to handle such cases.

5.2 ReductionIf a Reconfiguration causes total system utility to drop, the nextaction is either a Reduction or a Reversion. Henge performs Reduc-tion if and only if all of these conditions are true: (a) the cluster iscongested (described below), (b) there is at least one SLO-satisfyingtopology, and (c) there is no past history of a Reduction action. Notethat (c) implies that there is at most one Reduction (until an externalfactor such as workload changes) cluster-wide.

CPU load is defined as the number of processes that are runningor runnable on a machine [7]. A machine’s load should be less thanor equal to the number of available cores, ensuring maximum uti-lization and no over-subscription. Thus, Henge considers a machineto be congested if its CPU load exceeds its number of cores. Hengeconsiders a cluster to be congested when a majority of its machinesare congested.

If a Reconfiguration drops utility and results in a congested cluster,Henge executes Reduction to reduce congestion. For all topologiesmeeting their SLOs, Henge finds all their un-congested operators(except spouts) and reduces their parallelism by a large amount (80%in our settings). If this results in further SLO misses, such topologieswill be candidates in future reconfigurations to improve their SLO.

Henge limits Reduction to once per cluster to minimize intrusion;this is reset if external factors change (new jobs are added or inputrate changes etc.).

Like backoff mechanisms [42], massive reduction is the only wayto free many resources at once for reconfigurations. Fewer threadsalso make the system more efficient because they reduce the numberof OS context switches.

Right after a reduction, if the next reconfiguration drops clusterutility again while keeping the cluster congested (measured usingCPU load), Henge recognizes that performing another reduction isfutile. This is a typical “lockout" case, and Henge resolves it byReversion.

5.3 ReversionIf a Reconfiguration drops utility and a Reduction is not possible(meaning that at least one of the conditions (a)-(c) in Section 5.2 isfalse), Henge performs Reversion.

Henge sorts through its history of Reconfigurations and picks theone that maximized system utility. It moves the system back to thispast configuration by resetting the resource allocations of all jobsto values in this configuration and moves to the Converged state.Here, Henge essentially concludes that it is impossible to furtheroptimize cluster utility, given this workload. Henge maintains thisconfiguration until changes like further SLO violations occur, whichnecessitate reconfigurations.

If a large enough drop (> 5%) in utility occurs in this Convergedstate (e.g., due to new jobs, or input rate changes), Henge infersthat as reconfigurations cannot be a cause of this drop, the workloadof topologies must have changed. As all past actions no longerapply to this new workload, Henge forgets all history of past actionsand moves to the Not Converged state. In future reversions, suchforgotten states will not be available. This reset allows Henge to startits state space search afresh.

5.4 Proof of ConvergenceWe prove the convergence of Henge. As is usually necessary inproofs of real implementation, we make some assumptions. Ourimplementation and subsequent evaluation in Section 8 removes allassumptions, and demonstrates that Henge works well in practice.

Recall that a converged state is one where Henge can take nofurther actions (Fig. 4). We define a well-behaved cluster as onewhere: (a) a Reduction action on a job (Section 5.2) does not causeits SLO to be violated (recall that reductions are only done on jobsalready satisfying their SLO), and (b) between consecutive Hengesteps, the performance of each topology (throughput and/or latency),measured via its utility, remains stable.

THEOREM 1. Given any initial state on a well-behaved clusterwith no job arrivals or departures, and sufficiently long black-listing,Henge converges in finite steps.

PROOF. Consider a cluster with N jobs (topologies), J1 . . . JN .Henge ensures that in each step of its state machine, at least onetopology is affected. This effect may be one of four kinds—(i) jobJi is reconfigured (Section 5.1) and its utility improves by at least 3

a factor of (1 + ∆), or (ii) Ji is reconfigured and its utility improvesby less than a factor of (1 + ∆), or (iii) the cluster is subjected to a

3See definition of ∆ in Section 5.1. In our implementation ∆ = 5%.

253

Page 6: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

reduction (Section 5.2), or (iv) the cluster is subjected to a reversion(Section 5.3).

Consider the set S of topologies not satisfying their SLO. Weclaim that the cardinality of S is monotonically non-increasing aslong as steps under only cases (i)-(iii) are applied. We explain why.A case (i) step can only remove from (never add to) S, if and onlyif this step causes Ji to meet its SLO. A case (ii) step black-lists Ji(Section 5.1) and removes it from S—as we assumed black-listinglasts long enough, the monotonic property holds. Due to our well-behaved cluster assumption (a), the monotonic property holds after areduction operation (case (iii)). Lastly, in between consecutive steps,cardinality of S does not increase due to our well-behaved clusterassumption (b).

Now consider case (i). For job i, let Init_Util(i) be its initialutility, and Max_Util(i) be its maximum achievable utility (at SLOthreshold). Let:

F = MaxN

i=1

{Max_Util(i)

Init_Util(i)

}The initial number of jobs missing their SLOs is ≤ N . For a

topology Ji that converges to its SLO, Ji is subjected to at mostloд(1+∆)(F ) reconfigurations under case (i) (until it meets its SLO).For a topology Ji that becomes black-listed, it is subject to strictly< loд(1+∆)(F ) reconfigurations under case (i), along with an addi-tional (last) reconfiguration under case (ii). Together, any job thateither meets its SLOs or becomes black-listed, is subject to at mostloд(1+∆)(F ) steps under cases (i) or (ii).

Cluster-wide, there is at most one case (iii) reduction step (Sec-tion 5.2). Finally, reversion in case (iv) is, if present, always the laststep in the entire cluster.

Putting these together, the total number of steps required forHenge to converge is ≤ (N · loд(1+∆)(F ) + 1 + 1) steps. Thus, Hengeconverges in finite steps. □

Clause (b) in the well-behaved cluster assumption can be relaxedto restate the proof (and theorem), by amending F to also includethe max of Max_U til (i)

Red_U til (i) for all jobs Ji whose SLO is violated due toa reduction, where Red_Util(i) is Ji ’s utility right after its reduction.

5.5 DiscussionOnline vs. Offline State Space Search: Henge uses an online statespace search. In fact, our early attempt at designing Henge was toperform offline state space exploration, by using analytical modelsto find relations of SLO metrics to job resource allocations.

However, the offline approach turned out to be impractical. Anal-ysis and prediction is complex and inaccurate for stream process-ing systems, which are very dynamic in nature. (This phenomenonhas also been observed in other distributed scheduling domains,e.g., [24, 45, 65].) Fig. 5 shows two identical runs of the same Stormjob on 10 machines, where the job is given 28 extra threads at t=910s. Latency drops to a lower value in run 2, but stays high in run 1.This is due to differing CPU resource consumptions across the runs.Generally, we find that natural fluctuations occur commonly in anapplication’s throughput and latency; left to itself an application’sperformance changes and degrades gradually. This motivated us toadopt Henge’s conservative online approach.

Reconfiguration

Fig. 5: Unpredictability in Modern Stream Processing Engines: Tworuns of the same topology (on 10 machines) being given the same extracomputational resources (28 threads, i.e., executors) at 910 s, react differ-ently.

Tail Latencies: Henge can also support SLOs expressed as taillatencies (e.g., 95th or 99th percentile). Utility functions are thenexpressed in terms of tail latency.Statefulness, Memory Bottlenecks: In most production use cases,we have observed that operators are stateless and CPU-bound, andour exposition so far is thus focused. Even so, Henge gracefully han-dles stateful operators and memory-pressured nodes (Sections 8.3,8.5). Stateful jobs can rebuild state after a reconfiguration as weperiodically checkpoint state to an external database. The orthogonalproblem of dynamically reconfiguring stateful operators is beyondour scope and is an open direction [26].“Magic" Parameters: Henge uses five parameters: capacity thresh-old (Section 5.1), stability rounds (Section 5.1), improvement ∆in utility for reconfiguration (Section 5.1), percentage reduction inexecutors (Section 5.2), and percentage utility drop to move out ofconverged state (Section 5.3). All of these parameters only affectHenge’s convergence speed, not correctness.

Four of these five parameters have little effect on performance aslong as their values stay small. The percentage reduction in executorsmay require minimal profiling. In our cluster, reducing a topology’sexecutors by 90% caused SLO misses, while a 70% setting did notalleviate CPU load much. Thus, we set this parameter to 80% in ourimplementation, for all experimental workloads.Henge as Indicator of Resource Crunches: While we do not per-form admission control, Henge can nonetheless be used to identifytopologies that are overly hungry (e.g., high priority jobs that aretouched by many reconfiguration steps), and these could be black-listed or removed. Additionally, Henge could be used as an indi-cator that the cluster is oversubscribed and that additional clusterresources are needed, e.g., many SLO violations, cluster utility rarelyapproaches max utility, etc.Setting Priorities: A consequence of priorities is that Henge willprefer higher priority jobs at the expense of penalizing lower priorityjobs, especially when the cluster is congested. We recommend thatany jobs that wish to absolutely avoid starvation come associatedwith medium to high priorities. This is feasible as we envision Hengeto be used internally in companies where job priorities are set eitherconsensually or by upper management.Dynamic Priorities: While we have assumed for simplicity thata job’s priority is fixed at its submission time (Section 4), Hengewould still work if priorities were changed on the fly. Henge’s state

254

Page 7: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream Processing SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA

machine and utility function approach are compatible with dynamicpriorities.

6 JUICE: DEFINITION AND ALGORITHMAs stated in Section 1, we wish to design a metric for throughputSLOs that is independent of input rate. Henge uses a new metriccalled juice. Juice defines the fraction of input data that is processedby the topology per unit time. It lies in the interval [0, 1]. A valueof 1.0 means all the input data that arrived in the last time unithas been processed. Thus, users can set throughput requirementsas percentages of input rate (Section 4). Henge then attempts tomaintain this even as input rates change.

Any algorithm that calculates juice should be:1. Code Independent: It should be independent of the operators’code, and should be calculate-able by only considering the numberof tuples generated by operators.2. Rate Independent: It should be input-rate independent.3. Topology Independent: It should be independent of the shape andstructure of the topology. It should be correct in spite of duplication,merging, and splitting of tuples.Juice Intuition: Juice is formulated to reflect the global processingefficiency of a topology. An operator’s contribution to juice is theproportion of input passed in originally from the source (i.e., from allspouts) that it processed in a given time window. This is the impactof that operator and its upstream operators on this input. The juiceof a topology is then the normalized sum of juice values of all itssinks.Juice Calculation: Henge calculates juice in configurable windowsof time (unit time). Source input tuples are those that arrive at a spoutin unit time. For each operator o in a topology that has n parents, wedefine T io as the sum of tuples sent out from its ith parent per timeunit, and Eio as the number of tuples that operator o executed (pertime unit), from those received from parent i.

The per-operator contribution to juice, J so , is the proportion ofsource input sent from spout s that operator o received and processed.Given that J si is the juice of o’s ith parent, then J so is:

J so =n∑i=1

(J si ×

EioT io

)(5)

A spout s has no parents, and its juice: Js =EsTs = 1.0.

In eq. 5, the fraction EioT io

reflects the proportion of tuples an op-erator received from its parents, and processed successfully. If notuples are waiting in queues, this fraction is equal to 1.0. By multi-plying this value with the parent’s juice we accumulate through thetopology the effect of all upstream operators.

We make two important observations. In the term EioT io

, it is criticalto take the denominator as the number of tuples sent by a parentrather than received at the operator. This allows juice: a) to accountfor data splitting at the parent (fork in the DAG), and b) to be reducedby tuples dropped by the network. The numerator is the number ofprocessed tuples rather than the number of output tuples – this allowsjuice to generalize to operator types whose processing may droptuples (e.g., filter).

Given all operator juice values, a topology’s juice can be calcu-lated by normalizing w.r.t. number of spouts:∑

Sinks si , Spouts sj(Jsjsi )

Total Number o f Spouts(6)

If no tuples are lost in the system, the numerator equals the num-ber of spouts. To ensure that juice stays below 1.0, we normalize thesum with the number of spouts.Example 1: Consider Fig. 2 in Section 3. J sA = 1 × 10K

10K = 1 andJ sB = J sA × 8K

16K = 0.5. B has a TAB of 16K and not 8K, since B only

receives half the tuples that were sent out by operator A.The value of J sB = 0.5 indicates that B processed only half the

tuples sent out by parent A. This occurred as the parent’s output wassplit among children. (If alternately, B and C were sinks and D wereabsent from the topology, then their juice values would sum up tothe topology’s juice.). D has two parents: B and C. C is only ableto process 6K as it is congested. Thus, J sC = J sA × 6K

16K = 0.375. TCDthus becomes 6K. Hence, JCD = 0.375 × 6K

6K = 0.375. JBD is simply0.5× 8K

8K = 0.5. We sum the two and obtain J sD = 0.375+0.5 = 0.875.It is less than 1.0 as C was unable to process all tuples due tocongestion.Example 2 (Topology Juice with Duplication, Split, and Merge):Due to space, we refer the reader to our tech report [46] for theexample.Observations: First, while our description used unit time, our imple-mentation calculates juice using a sliding window of 1 minute, col-lecting data in sub-windows of 10 s. This needs only loose time syn-chronization across nodes. Second, collecting data in sub-windowsimplies that when short-lived bursts occur, Henge does not over-react.and instead adjusts resources in a conservative manner. Concretely,it reacts only if the burst is long-lived and stretches across multiplesub-windows. Third, eq. 6 treats all processed tuples equally–instead,a weighted sum could be used instead to capture differing prioritiesamong sinks. Fourth, processing guarantees (exactly, at least, at mostonce) are orthogonal to juice. Our experiments use non-acked Storm(at most once semantics), but Henge also works with acked Storm(at least once semantics).

7 IMPLEMENTATIONWe integrated Henge into Apache Storm [4]. Henge involves 3800lines of Java code. It is an implementation of the predefined ISched-uler interface. The scheduler runs in the Storm Nimbus daemon, andis invoked by Nimbus every 10 seconds. We updated Storm Configto let users set SLOs and utility functions for their topologies.

Fig. 6 shows Henge’s architecture. The Decision Maker imple-ments the Henge state machine (Section 5). The Statistics Modulecontinuously calculates cluster and per-topology metrics e.g., thenumber of tuples processed per task of an operator per topology, end-to-end tuple latencies, and the CPU load per node. This informationis used to produce metrics such as juice and utility, which are passedto the Decision Maker. The Decision Maker runs the state machine,and sends commands to Nimbus to implement actions. The StatisticsModule also tracks past states so that reversion can be performed.

255

Page 8: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

Statistics Module

Decision Maker

New Schedule

Henge

Supervisor

Executor

Nimbus

Worker Processes

Fig. 6: Henge Implementation in Storm

8 EVALUATIONWe evaluate Henge with several workloads, topologies, and SLOs.Further results are in our tech report [46].Experimental Setup: By default, our experiments used the Emulabcluster [82], with machines (2.4 GHz, 12 GB RAM) running Ubuntu12.04 LTS, over a 1 Gbps network. A machine runs Zookeeper [5]and Nimbus. Workers (Java processes running executors) are allottedacross 10 machines (we also evaluate scalability).

Transform

Sink

FilterSpout Join with database

FilterAggregate

Fig. 7: PageLoad Topology from Yahoo!.

Bolt SinkBoltSpout

Bolt Sink

Bolt

Spout

Bolt

Linear Topology

Diamond Topology

Spout Bolt

Spout

Spout

Sink

Sink

Sink

Star Topology

Fig. 8: Three Microbenchmark Storm Topologies.

Topologies: We use a combination of production and microbench-mark topologies. Production topologies include the “PageLoad"topology from Yahoo! Inc. [87] (shown in Fig. 7), and WordCount.Its operators are the most commonly used in production: filter-ing, transformation, aggregation. For some evaluations, we alsouse microbenchmark topologies (shown in Fig. 8) that are possiblesub-parts of larger topologies [87]; these are used in Section 8.4.1,Fig. 16, and Section 8.5.

These topologies are representative of today’s workloads. Alltopologies for modern stream processing engines that we have en-countered in production, and in literature [6, 77], are linear (Fig. 7)or at best linear-like (like Fig. 8). While Henge is built to handle

arbitrarily complex topologies (as long as they are DAGs), for prac-ticality, we use the topologies just discussed.

The latency SLO thresholds in our experiments are in order ofmilliseconds. Thus, we subject Henge to more constraints and stressthan the values listed in Table 1.

In each experimental run, we initially let topologies run for 900 swithout interference (to stabilize and to observe their performancewith vanilla Storm), and then enable Henge to take actions. All topol-ogy SLOs use a knee utility function (Section 4). A knee function isparameterized by the SLO threshold and the maximum utility value.Hence, below we use “SLO” as a shorthand for the SLO threshold,and specify the max utility value.Storm Configuration: In experiments where we compare Hengewith Storm, we used Storm’s default scheduler and began the Stormrun with identical configuration (e.g., cluster size, number of execu-tors) as the Henge run.

8.1 Henge Policy and Scheduling

8.1.1 Meeting SLOs.

T9 T8 T8 T7 T6 T5 T4T6 T3 T3 T2

T1 T1

Fig. 9: Maximizing Cluster Utility: Red (dotted) line is total system utility.Blue (solid) line is magnified slope of the red line. Vertical lines are reconfig-urations annotated by the job touched. Henge reconfigures higher max-utilityjobs first, leading to faster increase in system utility.

Maximizing Cluster Utility: To maximize total cluster utility,Henge greedily prefers to reconfigure those jobs first which have ahigher max achievable utility (among those missing their SLOs). InFig. 9, we run 9 PageLoad topologies on a cluster, with max utilityvalues ranging from 10 to 90 in steps of 10. The SLO threshold forall jobs is 60 ms. Henge first picks T9 (highest max utility of 90),leading to a sharp increase in total cluster utility at 950 s. Then, itcontinues in this greedy way. We observe some latent peaks whenjobs reconfigured in the past stabilize to their max utility. For in-stance, at 1425 s we observe a sharp increase in the slope (solid) lineas T4 (reconfigured at 1331 s) reaches its SLO threshold. All jobsmeet their SLO in 15 minutes (900 s to 1800 s).Hybrid SLOs: We evaluate a topology with a hybrid SLO that hasseparate thresholds for latency and juice, and two correspondingutility functions (Section 4) with identical max utility values. Thejob’s utility is then the average of these two utilities.

Fig. 10 shows 10 (identical) PageLoad topologies with hybridSLOs running on a cluster of 10 machines. Each topology has SLOthresholds of: juice 1.0, and latency 70 ms. The max utility valueof each topology is 35. Henge takes about 13 minutes (t=920 s tot=1710 s) to reconfigure all topologies successfully to meet their

256

Page 9: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream Processing SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA

Maximum Utility T2 T5 T10 T3 T8 T4 T7 T1 T6 T9 T9 T9

Fig. 10: Hybrid SLOs: Henge Reconfiguration.

SLOs. 9 out of 10 topologies required a single reconfiguration, andone (T9) required 3 reconfigurations.Is convergence fast enough? Stream processing jobs run for longperiods. Practically, Henge’s convergence times are relatively quitesmall. We ran 10 topologies over 48 hours with a diurnal workload(details in next section). The total time taken by Henge’s actions,summed across all topologies, was 2.4 hours. This is only 0.5% of thetotal runtime summed across all topologies. The longest convergencetime (for any topology) was 16 minutes, which is only 0.56% of thetotal runtime.

8.1.2 Handling Dynamic Workloads.

A. Spikes in Workload: Fig. 11 shows the effect of a workloadspike in Henge. Two different PageLoad topologies A and B aresubjected to input spikes. B’s workload spikes by 2 ×, starting from3600 s. The spike lasts until 7200 s when A’s spike (also 2 ×) begins.Each topology’s SLO is 80 ms with max utility is 35. Fig. 11 showsthat: i) output rates keep up for both topologies both during and afterthe spikes, and ii) the utility stays maxed-out during the spikes. Ineffect, Henge hides the effect of the input rate spike from the user.

Fig. 11: Spikes in Workload: Left y-axis shows total cluster utility (maxpossible is 35× 2 = 70). Right y-axis shows the variation in workload as timeprogresses.

B. Diurnal Workloads: Diurnal workloads are common for streamprocessing, e.g., in e-commerce websites [30] and social media [61].We generated a diurnal workload based on the distribution of theSDSC-HTTP [12] and EPA-HTTP traces [9], injecting them intoPageLoad topologies. 5 jobs run with the SDSC-HTTP trace andconcurrently, 5 other jobs run with the EPA-HTTP trace. All jobshave max-utility=35, and a latency SLO of 60 ms.

(a)

(b)

(c)

Fig. 12: Diurnal Workloads: a) Input and output rates vs time, for two diur-nal workloads. b) Utility of job (reconfigured by Henge) with EPA workload,c) CDF of SLO satisfaction for Henge, default Storm, & manual configura-tion. Henge adapts during first cycle and fewer reconfigurations are neededlater.

Fig. 12 shows the result of running 48 hours of the trace (eachhour is mapped to 10 mins). In Fig. 12a, workloads increase fromhour 7 of day 1, reach their peak by hour 13 1

3 , and then fall. Hengereconfigures all 10 jobs, reaching 89% of max cluster utility by hour15.

Fig. 12b shows a topology running the EPA workload (othertopologies exhibited similar behavior). Observe how Henge recon-figurations from hour 8 to 16 adapt to the fast changing workload.This results in fewer SLO violations during the second peak (hours32 to 40). Thus, Henge tackles diurnal workloads without extraresources.

Fig. 12c shows the CDF of SLO satisfation for three systems.Default Storm gives 0.0006% SLO satisfaction at the median, and30.9% at the 90th percentile (meaning that 90% of the time, defaultStorm provided at most 30.9% of the cluster’s max achievable util-ity.). Henge yields 74.9%, 99.1%, and 100% SLO satisfaction at the15th, 50th, and 90th percentiles respectively.

Henge is preferable over manual configurations. We manuallyconfigured all topologies to meet their SLOs at median load. Theyprovide 66.5%, 99.8% and 100% SLO satisfaction at the 15th, 50thand 90th percentiles respectively. Henge is better than manual con-figurations from the 15th to 45th percentile, and comparable later.

Henge has an average of 88.12% SLO satisfaction rate, while de-fault Storm and manually configured topologies provide an averageof 4.56% and 87.77% respectively. Thus, Henge gives 19.3× better

257

Page 10: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

SLO satisfaction than default Storm and does better than manualconfiguration.

8.2 Production WorkloadsWe configured the sizes of 5 PageLoad topologies based on a Yahoo!Storm production cluster and Twitter datasets [20], shown in Table 3.We use 20 nodes with 14 worker processes. For each topology, weuse an input rate proportional to its number of workers. T1-T4 runsentiment analysis on Twitter workloads [20]. T5 processes logs at aconstant rate. Each topology has a latency SLO threshold of 60 msand max utility of 35.

Job Workload Workers TasksT1 Analysis (Egypt Unrest) 234 1729T2 Analysis (London Riots) 31 459T3 Analysis (Tsunami in Japan) 8 100T4 Analysis (Hurricane Irene) 2 34T5 Processing Topology 1 18

Table 3: Job and Workload Distributions in Experiments: Derived fromYahoo! production clusters, using Twitter Datatsets for T1-T4. (Results inFigure 13.)

This is an extremely constrained cluster where not all SLOs can bemet. Yet, Henge improves cluster utility. Fig. 13a shows the CDF ofthe fraction of time each topology provided a given utility (includingthe initial 900 s where Henge is held back). T5 shows the mostimprovement (at the 5th percentile, it has 100% SLO satisfaction),whereas T4 shows the worst performance (at the median, its utility is24, which is 68.57% of 35). The median SLO satisfaction for T1-T3ranges from 27.0 to 32.3 (77.3% and 92.2% respectively).Reversion: Fig. 13b depicts Henge’s reversion. At 31710 s, thesystem utility drops due to natural system fluctuations. This forcesHenge to reconfigure two topologies (T1, T4). Since system utilitycontinues to drop, Henge is forced to reduce a topology (T5), whichsatisfies its SLO before and after reduction. As utility improves at32042 s, Henge proceeds to reconfigure other topologies. However,the last reconfiguration causes another drop in utility (at 32150 s).Henge reverts to the configuration that had the highest utility (at32090 s). After this point, total cluster utility stabilizes at 120 (68.6%of max utility). Thus, even under scenarios where Henge is unableto reach the max system utility it behaves gracefully, does not thrash,and converges quickly.

8.2.1 Reacting to Natural Fluctuations.

Natural fluctuations can occur in a cluster due to load variationsthat arise from interfering processes, disk IO, page swaps, etc. Fig. 14shows such a scenario. We run 8 PageLoad topologies, 7 of whichhave an SLO of 70 ms, and the 8th’s SLO is 10 ms. Henge resolvescongestion initially and stabilizes the cluster by 1000 s. At 21800 s,CPU load increases sharply due to OS behaviors (beyond Henge’scontrol). Due to the significant drop in cluster utility, Henge reducestwo SLO-achieving topologies. The reduction lets other topologiesrecover within 20 minutes (by 23000 s). Henge converges the clusterto the same total utility as before the CPU spike.

(a) CDF of fraction of total time that tenant topologies achieve given SLO.Max utility for each topology is 35.

Reverts last actionReductionT1 T4 T1 T4

(b) Reconfiguration at 32111 s causes drop in total system utility. Hengereverts configuration of all tenants to that of 32042 s. Vertical lines showHenge actions for given jobs.

Fig. 13: Henge on Production Workloads.

Maximum Utility

Load on machines during marked time

a)

b)

Reconfigurations

Reductions

Fig. 14: Handling CPU Load Spikes: a) Total cluster utility. b) AverageCPU load on machines in CPU spike interval.

8.3 Stateful TopologiesHenge handles stateful topologies gracefully. We ran four Word-Count topologies with identical workload and configuration as T2 inTable 3. Stateful topologies periodically checkpoint state to Redisand have 240 ms latency SLOs. The other two topologies do notpersist state in an external store and have lower SLOs of 60 ms.Initially, none of them meet their SLOs. Table 4 shows results after

258

Page 11: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream Processing SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA

Job Type Avg. Reconfig. Average ConvergenceRounds (Stdev) Time (Stdev)

Stateful 5.5 (0.6) 1358.7s (58.1s)Stateless 4 (0.8) 1134.2s (210.5s)

Table 4: Stateful Topologies: Convergence Rounds and Times for a clusterwith Stateful and Stateless Topologies.

convergence. Stateful topologies take 1.5 extra reconfigurations toconverge to their SLO, and 19.8% more reconfiguration time. Thisis due to checkpointing and recovery mechanisms, orthogonal toHenge.

8.4 Scalability and Fault-tolerance

8.4.1 Scalability.

Increasing the Number of Topologies: Fig. 15 stresses Henge byoverloading the cluster with topologies over time. We start with 5PageLoad jobs (latency SLO=70 ms, max utility=35), and add 20more jobs at even hours.

Maximum Utility

Fig. 15: Scalability w.r.t. No. of Topologies: Cluster has 5 tenants. 20tenants added every 2 hours until 8 hour mark. Green dotted line is averagejob utility. Blue solid line is number of jobs on cluster. Vertical black linesare reconfigurations.

Henge stabilizes better when there are more topologies and alarger state space. With fewer topologies (first 2 hours) there is lessstate space to maneuver in, and hence the average utility stays low.20 new tenant topologies at t=2 hours drop average utility but alsoopen up the state space more–Henge quickly converges the clusterto the max utility value. We observe the same behavior when morejobs arrive at t=4, 6, and 8 hours.

Henge also tends to do fewer reconfigurations on older topologiesthan on new ones–see our tech report [46].Increasing Cluster Size: In Fig. 16, we run 40 topologies on clus-ters of 10 to 40 machines (in our production use case studies, veryrarely did we see cluster sizes exceeding 40 nodes). Each machineshas 8-cores, 64 GB RAM, and a 10 Gbps interconnect. Topologies(with SLO, max utility) are: 20 PageLoads (80 ms, 35), 8 Diamond(juice=1.0, 5), 6 Star (1.0, 5), 6 Linear (1.0, 5).

As cluster size increases from an overloaded 10 nodes to 20 nodes,time to converge drops by 50%, and plateaus. At 10 nodes, only 40%of jobs meet SLO thresholds, and at the 5th percentile, jobs achieveonly 0.3% of their max utility. At 20, 30, and 40 nodes, 5th percentileSLO satisfactions are 56.4%, 74.0% and 94.5% respectively. We

Fig. 16: Scalability w.r.t No. of Machines: 40 jobs run on cluster sizesincreasing from 10 to 40 nodes. The bars show number of reconfigurationsuntil convergence.

point the reader to our tech report for details [46]. Overall, Henge’sperformance generally improves with cluster size, and overheadsscale independently.

8.4.2 Failure Recovery.Maximum Utility

Reconfigurations

Failure

RecoveryReconfigurations

Fig. 17: Fault-tolerance: Failure at t=1020s & recovery at t=1380 s. Hengemakes no wrong decisions due to failure, and immediately converges to maxsystem utility after recovery.

Henge reacts gracefully to failures. In Fig. 17, we run 9 PageLoadtopologies each with 70 ms SLO and 35 max utility. We introduce afailure at the worst possible time: during Henge’s reconfiguration op-erations, at 1020 s. This severs communication between Henge andall worker nodes; Henge’s Statistics module is unable to obtain freshjob information. Henge reacts conservatively by avoiding reconfig-uration in the absence of data. At 1380 s, when communication isrestored, Henge collects performance data for 5 minutes (until 1680s) and then proceeds with reconfigurations until it meets all SLOs.

8.5 Memory Utilization

Reconfigurations

Fig. 18: Memory Utilization: 8 jobs with joins and 30 s tuple retention.

Fig. 18 shows a cluster with 8 memory-intensive micro-benchmarktopologies (latency SLO=100 ms, max utility=50). These topologies

259

Page 12: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

contain joins where tuples are retained for 30 s, creating memorypressure at some nodes. The figure shows that Henge reconfiguresquickly to reach total max utility of 400 by 2444s, keeping averagememory usage below 36%. Critically, the memory utilization (bluedotted line) plateaus in the converged state, showing that Hengehandles memory-bound topologies gracefully.

9 RELATED WORKStream Processing: Among classical stream processing systems,Aurora [17] performs OS-level scheduling of threads and drops tu-ples in order to provide QoS for multiple queries [25]. However,Aurora’s techniques do not apply or extend to a cluster environment.Borealis [18] is a distributed stream processing system that supportsQoS, but is not multi-tenant. It orchestrates inflowing data, and opti-mizes coverage to reduce tuple-shedding. Borealis, as implemented,does not tackle the hard problems associated with multi-tenancy.

Modern stream processing systems [1–4, 11, 19, 54, 60] do not na-tively handle adaptive elasticity. Ongoing work [8] on Spark Stream-ing [88] allows scaling but does not apply to resource-limited multi-tenant clusters. [26, 68] scale out stateful operators and checkpoint,but do not scale in or support multi-tenancy.

Resource-aware elasticity in stream processing [23, 27, 34, 47,50, 66, 71] assumes infinite resources that a tenant can scale outto. [21, 36, 43, 55, 56, 72, 84] propose resource provisioning butnot multi-tenancy. Some works have focused on balancing load [62,63, 75], optimal operator placement [38, 49, 67] and scaling outstrategies [39, 40] in stream processing. These approaches can beused to complement Henge. [38–40] look at single-job elasticity, butnot multi-tenancy.

Themis [48] and others [76, 91] reduce load in stream processingsystems by dropping tuples. Henge does not drop tuples. Themisuses the SIC metric to evaluate how much each tuple contributes interms of accuracy to the result. Henge’s juice metric is different inthat it is used to calculate processing rate and it serves as an inputrate-independent throughput metric.

Dhalion [33] supports throughput SLOs for Heron, but does notgeneralize to varying input rates. It uses backpressure as a trigger forscaling out topologies, but as backpressure takes time to propagate(e.g., after spikes), it is less responsive than CPU load (which Hengeuses).Multi-tenant Resource Management Systems: Resource sched-ulers like YARN [80] and Mesos [41] can be run under streamprocessing systems, and manually tuned [28]. As job internals arenot exposed to the scheduler, it is hard to make fine-grained decisionsfor stream processing jobs in an automated fashion.Cluster Scheduling: Some scheduling work addresses resource fair-ness and SLO achievement [31, 32, 37, 58, 69, 73]. VM-based scal-ing approaches [51] do not map directly and efficiently to expressiveframeworks like stream processing systems. Among multi-tenantstream processing systems, Chronostream [85] achieves elasticitythrough migration across nodes. It does not support SLOs.SLAs/SLOs in Other Areas: SLAs/SLOs have been explored inother areas. Pileus [78] is a geo-distributed storage system thatsupports multi-level SLA requirements dealing with latency andconsistency. Tuba [22] builds on Pileus and uses reconfiguration toadapt to changing workloads. SPANStore [86] is a geo-replicated

storage service that automates trading off cost vs. latency, whilebeing consistent and fault-tolerant. E-store [74] re-distributes hotand cold data chunks across cluster nodes if load exceeds a threshold.Cake [81] supports latency and throughput SLOs in multi-tenantstorage settings.

10 CONCLUSIONWe presented Henge, a system for intent-driven, SLO-based multi-tenant stream processing. Henge provides SLO satisfaction for jobswith latency and/or throughput SLOs. To make throughput SLOsindependent of input rate and topology structure, Henge uses a newmetric called juice. When jobs miss their SLO, Henge uses threekinds of actions (reconfiguration, reversion or reduction) to improvethe total utility achieved cluster-wide. Evaluation with Yahoo! topolo-gies and Twitter datasets showed that in multi-tenant settings witha mix of SLOs, Henge: i) converges quickly to max system utilitywhen resources suffice; ii) converges quickly to a high system utilityon a constrained cluster; iii) gracefully handles dynamic workloads;iv) scales with increasing cluster size and jobs; and v) recovers fromfailures.

ACKNOWLEDGEMENTSThis work was supported in part by: NSF CNS 1409416, NSF CNS1319527, AFOSR/AFRL FA8750-11-2-0084, and a generous giftfrom Microsoft. We would like to thank Robert Evans from theYahoo! Storm team for workload traces and his invaluable feed-back. We thank Boyang Jerry Peng and Reza Farivar for feedbackand discussions on initial versions of our system. We also thankUmar Kalim, Mainak Ghosh and Sangeetha Abdu Jyothi for theirinvaluable input. We thank University of Utah for Emulab support.

REFERENCES[1] Apache Flink. http://flink.apache.org/. Last Visited: Thursday 6th September,

2018.[2] Apache Flume. https://flume.apache.org/. Last Visited: Thursday 6th September,

2018.[3] Apache Samza. http://samza.apache.org/. Last Visited: 03/2016.[4] Apache Storm. http://storm.apache.org/. Last Visited: Thursday 6th September,

2018.[5] Apache Zookeeper. http://zookeeper.apache.org/. Last Visited: Thursday 6th

September, 2018.[6] Based on Private Conversations with Yahoo! Storm Team.[7] CPU Load. http://www.linuxjournal.com/article/9001. Last Visited: Thursday 6th

September, 2018.[8] Elasticity in Spark Core. http://www.ibmbigdatahub.com/blog/

explore-true-elasticity-spark/. Last Visited: Thursday 6th September, 2018.[9] EPA-HTTP Trace. http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html. Last Visited:

Thursday 6th September, 2018.[10] How to collect and analyze data from 100,000 weather

stations. https://www.cio.com/article/2936592/big-data/how-to-collect-and-analyze-data-from-100000-weather-stations.html. LastVisited: Thursday 6th September, 2018.

[11] S4. http://incubator.apache.org/s4/. Last Visited: 03/2016.[12] SDSC-HTTP Trace. http://ita.ee.lbl.gov/html/contrib/SDSC-HTTP.html. Last

Visited: Thursday 6th September, 2018.[13] SLOs. https://en.wikipedia.org/wiki/Service_level_objective. Last Visited: Thurs-

day 6th September, 2018.[14] Storm 0.8.2 Release Notes. http://storm.apache.org/2013/01/11/storm082-released.

html/. Last Visited: Thursday 6th September, 2018.[15] Storm Applications. http://storm.apache.org/Powered-By.html. Last Visited:

Thursday 6th September, 2018.[16] Uber Releases Hourly Ride Numbers In New York City

To Fight De Blasio. https://techcrunch.com/2015/07/22/

260

Page 13: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

Henge: Intent-driven Multi-Tenant Stream Processing SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA

uber-releases-hourly-ride-numbers-in-new-york-city-to-fight-de-blasio/.Last Visited: Thursday 6th September, 2018.

[17] ABADI, D., CARNEY, D., CETINTEMEL, U., CHERNIACK, M., CONVEY, C.,ERWIN, C., GALVEZ, E., HATOUN, M., MASKEY, A., RASIN, A., ET AL.Aurora: A Data Stream Management System. In Proceedings of the InternationalConference on Management of Data (SIGMOD) (2003), ACM, pp. 666–666.

[18] ABADI, D. J., AHMAD, Y., BALAZINSKA, M., CETINTEMEL, U., CHERNIACK,M., HWANG, J.-H., LINDNER, W., MASKEY, A., RASIN, A., RYVKINA, E.,ET AL. The Design of the Borealis Stream Processing Engine. In Proceedings ofthe Conference on Innovative Data Systems Research (2005), vol. 5, pp. 277–289.

[19] AKIDAU, T., BALIKOV, A., BEKIROGLU, K., CHERNYAK, S., HABERMAN,J., LAX, R., MCVEETY, S., MILLS, D., NORDSTROM, P., AND WHITTLE, S.Millwheel: Fault-Tolerant Stream Processing at Internet Scale. In Proceedings ofthe VLDB Endowment (2013), vol. 6, VLDB Endowment, pp. 1033–1044.

[20] AMIN, T. Apollo Social Sensing Toolkit. http://apollo3.cs.illinois.edu/datasets.html, 2014. Last Visited: Thursday 6th September, 2018.

[21] ANIELLO, L., BALDONI, R., AND QUERZONI, L. Adaptive Online Schedulingin Storm. In Proceedings of the 7th ACM International Conference on DistributedEvent-Based Systems (2013), ACM, pp. 207–218.

[22] ARDEKANI, M. S., AND TERRY, D. B. A Self-Configurable Geo-ReplicatedCloud Storage System. In Proceedings of the 11th USENIX Symposium onOperating Systems Design and Implementation (OSDI) (2014), pp. 367–381.

[23] BALKESEN, C., TATBUL, N., AND ÖZSU, M. T. Adaptive Input Admissionand Management for Parallel Stream Processing. In Proceedings of the 7thACM International Conference on Distributed Event-Based Systems (2013), ACM,pp. 15–26.

[24] BILAL, M., AND CANINI, M. Towards Automatic Parameter Tuning of StreamProcessing Systems. In Proceedings of the Symposium on Cloud Computing(2017), ACM, p. 1.

[25] CARNEY, D., ÇETINTEMEL, U., RASIN, A., ZDONIK, S., CHERNIACK, M.,AND STONEBRAKER, M. Operator scheduling in a data stream manager. InProceedings 2003 VLDB Conference (2003), Elsevier, pp. 838–849.

[26] CASTRO FERNANDEZ, R., MIGLIAVACCA, M., KALYVIANAKI, E., AND PIET-ZUCH, P. Integrating Scale-Out and Fault Tolerance in Stream Processing usingOperator State Management. In Proceedings of the International Conference onManagement of Data (SIGMOD) (2013), ACM, pp. 725–736.

[27] CERVINO, J., KALYVIANAKI, E., SALVACHUA, J., AND PIETZUCH, P. AdaptiveProvisioning of Stream Processing Systems in the Cloud. In Proceedings of the28th International Conference on Data Engineering Workshops (2012), IEEE,pp. 295–301.

[28] CLOUDERA. Tuning YARN — Cloudera. http://www.cloudera.com/documentation/enterprise/5-2-x/topics/cdh_ig_yarn_tuning.html, 2016. Last Vis-ited Thursday 6th September, 2018.

[29] CURINO, C., DIFALLAH, D. E., DOUGLAS, C., KRISHNAN, S., RAMAKRISH-NAN, R., AND RAO, S. Reservation-Based Scheduling: If You’re Late Don’tBlame Us! In Proceedings of the ACM Symposium on Cloud Computing (2014),ACM, pp. 1–14.

[30] DECANDIA, G., HASTORUN, D., JAMPANI, M., KAKULAPATI, G., LAKSHMAN,A., PILCHIN, A., SIVASUBRAMANIAN, S., VOSSHALL, P., AND VOGELS, W.Dynamo: Amazon’s Highly Available Key-Value Store. ACM SIGOPS OperatingSystems Review 41, 6 (2007), 205–220.

[31] DELIMITROU, C., AND KOZYRAKIS, C. Paragon: QoS-Aware Scheduling forHeterogeneous Datacenters. In Proceedings of the 18th International Conferenceon Architectural Support for Programming Languages and Operating Systems(New York, NY, USA, 2013), ASPLOS ’13, ACM, pp. 77–88.

[32] DELIMITROU, C., AND KOZYRAKIS, C. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conferenceon Architectural Support for Programming Languages and Operating Systems(New York, NY, USA, 2014), ASPLOS ’14, ACM, pp. 127–144.

[33] FLORATOU, A., AGRAWAL, A., GRAHAM, B., RAO, S., AND RAMASAMY,K. Dhalion: Self-Regulating Stream Processing in Heron. In Proceedings of theVLDB Endowment (2017), ACM, p. 1.

[34] FU, T. Z., DING, J., MA, R. T., WINSLETT, M., YANG, Y., AND ZHANG, Z.DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams.In Proceedings of the 35th International Conference on Distributed ComputingSystems (2015), IEEE, pp. 411–420.

[35] GEDIK, B., ANDRADE, H., WU, K.-L., YU, P. S., AND DOO, M. SPADE:The System S Declarative Stream Processing Engine. In Proceedings of theInternational Conference on Management of Data (SIGMOD) (2008), ACM,pp. 1123–1134.

[36] GEDIK, B., SCHNEIDER, S., HIRZEL, M., AND WU, K.-L. Elastic Scaling forData Stream Processing. IEEE Transactions on Parallel and Distributed Systems25, 6 (2014), 1447–1463.

[37] GHODSI, A., ZAHARIA, M., HINDMAN, B., KONWINSKI, A., SHENKER, S.,AND STOICA, I. Dominant Resource Fairness: Fair Allocation of Multiple Re-source Types. In Proceedings of the 8th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI) (2011), vol. 11, pp. 24–24.

[38] HEINZE, T., JERZAK, Z., HACKENBROICH, G., AND FETZER, C. Latency-Aware Elastic Scaling for Distributed Data Stream Processing Systems. In Pro-ceedings of the 8th ACM International Conference on Distributed Event-BasedSystems (2014), ACM, pp. 13–22.

[39] HEINZE, T., PAPPALARDO, V., JERZAK, Z., AND FETZER, C. Auto-ScalingTechniques for Elastic Data Stream Processing. In Proceedings of the 30thInternational Conference on Data Engineering Workshops (2014), IEEE, pp. 296–302.

[40] HEINZE, T., ROEDIGER, L., MEISTER, A., JI, Y., JERZAK, Z., AND FETZER, C.Online Parameter Optimization for Elastic Data Stream Processing. In Proceedingsof the 6th ACM Symposium on Cloud Computing (2015), ACM, pp. 276–287.

[41] HINDMAN, B., KONWINSKI, A., ZAHARIA, M., GHODSI, A., JOSEPH, A. D.,KATZ, R. H., SHENKER, S., AND STOICA, I. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIXSymposium on Networked Systems Design and Implementation (NSDI) (2011),vol. 11, pp. 22–22.

[42] JACOBSON, V. Congestion Avoidance and Control. In Proceedings of the ACMSIGCOMM Computer Communication Review (1988), vol. 18, ACM, pp. 314–329.

[43] JAIN, N., AMINI, L., ANDRADE, H., KING, R., PARK, Y., SELO, P., ANDVENKATRAMANI, C. Design, Implementation, and Evaluation of the Linear RoadBenchmark on the Stream Processing Core. In Proceedings of the InternationalConference on Management of Data (SIGMOD) (2006), ACM, pp. 431–442.

[44] JONES, C., WILKES, J., MURPHY, N., AND SMITH, C. Service Level Objectives.https://landing.google.com/sre/book/chapters/service-level-objectives.html. LastVisited: Thursday 6th September, 2018.

[45] JYOTHI, S. A., CURINO, C., MENACHE, I., NARAYANAMURTHY, S. M., TU-MANOV, A., YANIV, J., GOIRI, Í., KRISHNAN, S., KULKARNI, J., AND RAO, S.Morpheus: Towards Automated SLOs for Enterprise Clusters. In Proceedings ofthe 12th USENIX Symposium on Operating Systems Design and Implementation(2016), p. 117.

[46] KALIM, F., XU, L., BATHEY, S., MEHERWAL, R., AND GUPTA, I. Henge:Intent-driven Multi-Tenant Stream Processing. https://arxiv.org/abs/1802.00082.Last Visited: Thursday 6th September, 2018.

[47] KALYVIANAKI, E., CHARALAMBOUS, T., FISCATO, M., AND PIETZUCH, P.Overload Management in Data Stream Processing Systems with Latency Guar-antees. In Proceedings of the 7th IEEE International Workshop on FeedbackComputing (Feedback Computing) (2012).

[48] KALYVIANAKI, E., FISCATO, M., SALONIDIS, T., AND PIETZUCH, P. Themis:Fairness in Federated Stream Processing under Overload. In Proceedings of the2016 International Conference on Management of Data (2016), ACM, pp. 541–553.

[49] KALYVIANAKI, E., WIESEMANN, W., VU, Q. H., KUHN, D., AND PIETZUCH,P. SQPR: Stream Query Planning with Reuse. In Proceedings of the 27thInternational Conference on Data Engineering (April 2011), pp. 840–851.

[50] KLEIMINGER, W., KALYVIANAKI, E., AND PIETZUCH, P. Balancing Loadin Stream Processing with the Cloud. In Proceedings of the 27th InternationalConference on Data Engineering Workshops (2011), IEEE, pp. 16–21.

[51] KNAUTH, T., AND FETZER, C. Scaling Non-Elastic Applications Using Vir-tual Machines. In Proceedings of the IEEE International Conference on CloudComputing, (July 2011), pp. 468–475.

[52] KOPETZ, H. Real-Time Systems: Design Principles for Distributed EmbeddedApplications. Springer, 2011.

[53] KREPS, J., NARKHEDE, N., RAO, J., ET AL. Kafka: A Distributed MessagingSystem for Log Processing. In Proceedings of the NetDB (2011), pp. 1–7.

[54] KULKARNI, S., BHAGAT, N., FU, M., KEDIGEHALLI, V., KELLOGG, C., MIT-TAL, S., PATEL, J. M., RAMASAMY, K., AND TANEJA, S. Twitter Heron:Stream Processing at Scale. In Proceedings of the International Conference onManagement of Data (SIGMOD) (2015), ACM, pp. 239–250.

[55] LI, B., DIAO, Y., AND SHENOY, P. Supporting Scalable Analytics with LatencyConstraints. In Proceedings of the VLDB Endowment (2015), vol. 8, VLDBEndowment, pp. 1166–1177.

[56] LOESING, S., HENTSCHEL, M., KRASKA, T., AND KOSSMANN, D. Stormy:An Elastic and Highly Available Streaming Service in the Cloud. In Proceedingsof the Joint EDBT/ICDT Workshops (2012), ACM, pp. 55–60.

[57] LOW, S. H., AND LAPSLEY, D. E. Optimization Flow Control. I. Basic Algorithmand Convergence. IEEE/ACM Transactions on networking 7, 6 (1999), 861–874.

[58] MACE, J., BODIK, P., FONSECA, R., AND MUSUVATHI, M. Retro: TargetedResource Management in Multi-Tenant Distributed Systems. In Proceedings ofthe 12th USENIX Symposium on Networked Systems Design and Implementation(NSDI) (2015), pp. 589–603.

[59] MARKETS AND MARKETS. Streaming analytics market worth 13.70 Bil-lion USD by 2021. https://www.marketsandmarkets.com/Market-Reports/streaming-analytics-market-64196229.html. Last Visited: Thursday 6th Sep-tember, 2018.

[60] MURRAY, D. G., MCSHERRY, F., ISAACS, R., ISARD, M., BARHAM, P., ANDABADI, M. Naiad: A Timely Dataflow System. In Proceedings of the 24th ACMSymposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP

261

Page 14: Henge: Intent-driven Multi-Tenant Stream Processinglexu1.web.engr.illinois.edu/files/henge-cr.pdfengines. Henge has to tackle several challenges. In a cluster of lim-ited resources,

SoCC ’18, October 11–13, 2018, Carlsbad, CA, USA F. Kalim et al.

’13, ACM, pp. 439–455.[61] NAAMAN, M., ZHANG, A. X., BRODY, S., AND LOTAN, G. On the Study of

Diurnal Urban Routines on Twitter. In Proceedings of the 6th International AAAIConference on Weblogs and Social Media (2012).

[62] NASIR, M. A. U., MORALES, G. D. F., GARCÃ A-SORIANO, D., KOURTELLIS,N., AND SERAFINI, M. The Power of Both Choices: Practical Load Balancing forDistributed Stream Processing Engines. In Proceedings of the 31st InternationalConference on Data Engineering (ICDE) (April 2015), pp. 137–148.

[63] NASIR, M. A. U., MORALES, G. D. F., KOURTELLIS, N., AND SERAFINI,M. When Two Choices are Not Enough: Balancing at Scale in DistributedStream Processing. In Proceedings of the 32nd International Conference on DataEngineering (ICDE) (May 2016), IEEE, pp. 589–600.

[64] NOGHABI, S. A., PARAMASIVAM, K., PAN, Y., RAMESH, N., BRINGHURST, J.,GUPTA, I., AND CAMPBELL, R. H. Samza: Stateful scalable stream processingat linkedin. Proceedings of the VLDB Endowment 10, 12 (2017), 1634–1645.

[65] OUSTERHOUT, KAY AND CANEL, CHRISTOPHER AND RATNASAMY, SYLVIAAND SHENKER, SCOTT. Monotasks: Architecting for Performance Clarity inData Analytics Frameworks. In Proceedings of the 26th Symposium on OperatingSystems Principles (SOSP) (2017).

[66] PENG, B., HOSSEINI, M., HONG, Z., FARIVAR, R., AND CAMPBELL, R. R-Storm: Resource-Aware Scheduling in Storm. In Proceedings of the 16th AnnualMiddleware Conference (2015), ACM, pp. 149–161.

[67] PIETZUCH, P., LEDLIE, J., SHNEIDMAN, J., ROUSSOPOULOS, M., WELSH, M.,AND SELTZER, M. Network-Aware Operator Placement for Stream-ProcessingSystems. In Proceedings of the 22nd International Conference on Data Engineer-ing (ICDE’06) (April 2006), pp. 49–49.

[68] QIAN, Z., HE, Y., SU, C., WU, Z., ZHU, H., ZHANG, T., ZHOU, L., YU, Y.,AND ZHANG, Z. TimeStream: Reliable Stream Computation in the Cloud. InProceedings of the 8th ACM European Conference on Computer Systems (NewYork, NY, USA, 2013), EuroSys ’13, ACM, pp. 1–14.

[69] RAMESHAN, N., LIU, Y., NAVARRO, L., AND VLASSOV, V. Hubbub-Scale:Towards Reliable Elastic Scaling under Multi-Tenancy. In Proceedings of the16th International Symposium on Cluster, Cloud and Grid Computing (CCGrid)(2016), IEEE, pp. 233–244.

[70] RAVINDRAN, B., JENSEN, E. D., AND LI, P. On Recent Advances in Time/UtilityFunction Real-Time Scheduling and Resource Management. In Object-OrientedReal-Time Distributed Computing, 2005. ISORC 2005. Eighth IEEE InternationalSymposium on (2005), IEEE, pp. 55–60.

[71] SATZGER, B., HUMMER, W., LEITNER, P., AND DUSTDAR, S. Esc: TowardsAn Elastic Stream Computing Platform for the Cloud. In Proceedings of the 4thInternational Conference on Cloud Computing (2011), IEEE, pp. 348–355.

[72] SCHNEIDER, S., ANDRADE, H., GEDIK, B., BIEM, A., AND WU, K.-L. ElasticScaling of Data Parallel Operators in Stream Processing. In Proceedings ofInternational Parallel and Distributed Processing Symposium (2009), IEEE, pp. 1–12.

[73] SHUE, D., FREEDMAN, M. J., AND SHAIKH, A. Performance Isolation andFairness for Multi-Tenant Cloud Storage. In Proceedings of the 10th Symposiumon Operating Systems Design and Implementation (2012), vol. 12, pp. 349–362.

[74] TAFT, R., MANSOUR, E., SERAFINI, M., DUGGAN, J., ELMORE, A. J., ABOUL-NAGA, A., PAVLO, A., AND STONEBRAKER, M. E-Store: Fine-Grained ElasticPartitioning for Distributed Transaction Processing Systems. In Proceedings ofthe VLDB Endowment (Nov. 2014), vol. 8, VLDB Endowment, pp. 245–256.

[75] TATBUL, N., AHMAD, Y., ÇETINTEMEL, U., HWANG, J.-H., XING, Y., ANDZDONIK, S. Load Management and High Availability in the Borealis DistributedStream Processing Engine. GeoSensor Networks (2006), 66–85.

[76] TATBUL, N., ÇETINTEMEL, U., AND ZDONIK, S. Staying Fit: Efficient LoadShedding Techniques for Distributed Stream Processing. In Proceedings of the33rd international conference on Very large data bases (2007), VLDB Endowment,

pp. 159–170.[77] TEAM, Y. S. Benchmarking Streaming Computation En-

gines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at, 2015. Last Visited:Thursday 6th September, 2018.

[78] TERRY, D. B., PRABHAKARAN, V., KOTLA, R., BALAKRISHNAN, M., AGUIL-ERA, M. K., AND ABU-LIBDEH, H. Consistency-Based Service Level Agree-ments for Cloud Storage. In Proceedings of the 24th ACM Symposium on Operat-ing Systems Principles (2013), ACM, pp. 309–324.

[79] TOSHNIWAL, A., TANEJA, S., SHUKLA, A., RAMASAMY, K., PATEL, J. M.,KULKARNI, S., JACKSON, J., GADE, K., FU, M., DONHAM, J., ET AL. Storm@ Twitter. In Proceedings of the International Conference on Management ofData (SIGMOD) (2014), ACM, pp. 147–156.

[80] VAVILAPALLI, V. K., MURTHY, A. C., DOUGLAS, C., AGARWAL, S., KONAR,M., EVANS, R., GRAVES, T., LOWE, J., SHAH, H., SETH, S., ET AL. ApacheHadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4thAnnual Symposium on Cloud Computing (2013), ACM, p. 5.

[81] WANG, A., VENKATARAMAN, S., ALSPAUGH, S., KATZ, R., AND STOICA, I.Cake: Enabling High-level SLOs on Shared Storage Systems. In Proceedings ofthe 3rd ACM Symposium on Cloud Computing (2012), ACM, p. 14.

[82] WHITE, B., LEPREAU, J., STOLLER, L., RICCI, R., GURUPRASAD, S., NEW-BOLD, M., HIBLER, M., BARB, C., AND JOGLEKAR, A. An Integrated Experi-mental Environment for Distributed Systems and Networks. In Proceedings of the5th Symposium on Operating Systems Design and Implementation (Boston, MA,Dec. 2002), USENIX Association, pp. 255–270.

[83] WIKIPEDIA. Pareto Efficiency — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Pareto_efficiency&oldid=741104719, 2016.Last Visited Thursday 6th September, 2018.

[84] WU, K.-L., HILDRUM, K. W., FAN, W., YU, P. S., AGGARWAL, C. C.,GEORGE, D. A., GEDIK, B., BOUILLET, E., GU, X., LUO, G., ET AL. Chal-lenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitor-ing Application on System S. In Proceedings of the 33rd International Conferenceon Very Large Data Bases (2007), VLDB Endowment, pp. 1185–1196.

[85] WU, Y., AND TAN, K.-L. Chronostream: Elastic Stateful Stream Computationin the Cloud. In Proceedings of the 31st International Conference on DataEngineering (2015), IEEE, pp. 723–734.

[86] WU, Z., BUTKIEWICZ, M., PERKINS, D., KATZ-BASSETT, E., AND MAD-HYASTHA, H. V. Spanstore: Cost-effective Geo-replicated Storage SpanningMultiple Cloud Services. In Proceedings of the 24th ACM Symposium on Operat-ing Systems Principles (2013), ACM, pp. 292–308.

[87] XU, L., PENG, B., AND GUPTA, I. Stela: Enabling Stream Processing Systems toScale-in and Scale-out On-demand. In IEEE International Conference on CloudEngineering (IC2E) (2016), IEEE, pp. 22–31.

[88] ZAHARIA, M., DAS, T., LI, H., HUNTER, T., SHENKER, S., AND STOICA,I. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. InProceedings of the 24th ACM Symposium on Operating Systems Principles (NewYork, NY, USA, 2013), SOSP ’13, ACM, pp. 423–438.

[89] ZHANG, H., ANANTHANARAYANAN, G., BODIK, P., PHILIPOSE, M., BAHL,P., AND FREEDMAN, M. J. Live video analytics at scale with approximation anddelay-tolerance. In NSDI (2017), vol. 9, p. 1.

[90] ZHANG, H., STAFMAN, L., OR, A., AND FREEDMAN, M. J. Slaq: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017Symposium on Cloud Computing (2017), ACM, pp. 390–404.

[91] ZHAO, H. C., XIA, C. H., LIU, Z., AND TOWSLEY, D. A Unified ModelingFramework for Distributed Resource Allocation of General Fork and Join Process-ing Networks. In ACM SIGMETRICS Performance Evaluation Review (2010),vol. 38, ACM, pp. 299–310.

262