GRASS: Trimming Stragglers in Approximation Analyticsusers.cms.caltech.edu/~adamw/papers/straggler.pdfSpark [15] (for interactive jobs). We evaluate GRASS using production workloads

GRASS: Trimming Stragglers in Approximation Analytics

Ganesh Ananthanarayanan1, Michael Chien-Chun Hung2, Xiaoqi Ren3, Ion Stoica1, Adam Wierman3, Minlan Yu2

1University of California, Berkeley, 2University of Southern California, 3California Institute of Technology

AbstractIn big data analytics timely results, even if based on

only part of the data, are often good enough. For thisreason, approximation jobs, which have deadline or er-ror bounds and require only a subset of their tasks tocomplete, are projected to dominate big data workloads.Straggler tasks are an important hurdle when designingapproximate data analytic frameworks, and the widelyadopted approach to deal with them is speculative ex-ecution. In this paper, we present GRASS, which care-fully uses speculation to mitigate the impact of stragglersin approximation jobs. The design of GRASS is basedon first principles analysis of the impact of speculativecopies. GRASS delicately balances immediacy of im-proving the approximation goal with the long term impli-cations of using extra resources for speculation. Evalua-tions with production workloads from Facebook and Mi-crosoft Bing in an EC2 cluster of 200 nodes shows thatGRASS increases accuracy of deadline-bound jobs by47% and speeds up error-bound jobs by 38%. GRASS’sdesign also speeds up exact computations, making it aunified solution for straggler mitigation.

1 Introduction

Large scale data analytics frameworks automaticallycompose jobs operating on large data sets into manysmall tasks and execute them in parallel on computeslots on different machines. A key feature catalyzing thewidespread adoption of these frameworks is their abil-ity to guard against failures of tasks, both when tasksfail outright as well as when they run slower than theother tasks of the job. Dealing with the latter, referred toas stragglers, is a crucial design component that has re-ceived widespread attention across prior studies [1, 2, 3].

The dominant technique to mitigate stragglersis speculation—launching speculative copies for theslower tasks, where a speculative copy is simply a dupli-cate of the original task. It then becomes a race betweenthe original and the speculative copies. Such techniquesare state-of-the-art and deployed in production clustersat Facebook and Microsoft Bing, thereby significantlyspeeding up jobs. The focus of this paper is on specula-tion for an emerging class of jobs: approximation jobs.

Approximation jobs are starting to see considerableinterest in data analytics clusters [4, 5, 6]. These jobsare based on the premise that providing a timely result,

even if only on part of the dataset, is more important thanprocessing the entire data. These jobs tend to have ap-proximation bounds on two dimensions—deadline anderror [7]. Deadline-bound jobs strive to maximize theaccuracy of their result within a specified time deadline.Error-bound jobs, on the other hand, strive to minimizethe time taken to reach a specified error limit in the re-sult. Typically, approximation jobs are launched on alarge dataset and require only a subset of their tasks tofinish based on the bound [8, 9, 10].

Our focus is on the problem of speculation for approx-imation jobs.1 Traditional speculation techniques forstraggler mitigation face a fundamental limitation whendealing with approximation jobs, since they do not takeinto account approximation bounds. Ideally, when thejob has many more tasks than compute slots, we want toprioritize those tasks that are likely to complete withinthe deadline or those that contribute the earliest to meet-ing the error bound. By not considering the approxi-mation bounds, state-of-the-art straggler mitigation tech-niques in production clusters at Facebook and Bing fallsignificantly short of optimal mitigation. They are 48%lower in average accuracy for deadline-bound jobs and40% higher in average duration of error-bound jobs.

Optimally prioritizing tasks of a job to slots is a classicscheduling problem with known heuristics [11, 12, 13].These heuristics, unfortunately, do not directly carryover to our scenario for the following reasons. First,they calculate the optimal ordering statically. Stragglingof tasks, on the other hand, is unpredictable and ne-cessitates dynamic modification of the priority orderingof tasks according to the approximation bounds. Sec-ond, and most importantly, traditional prioritization tech-niques assign tasks to slots assuming every task to oc-cupy only one slot. Spawning a speculative copy, how-ever, leads to the same task using two (or multiple)slots simultaneously. Hence, this distills our challengeto achieving the approximation bounds by dynamicallyweighing the gains due to speculation against the cost ofusing extra resources for speculation.

Scheduling a speculative copy helps make immediateprogress by finishing a task faster. However, while notscheduling a speculative copy results in the task run-ning slower, many more tasks may be completed using

1Note that an error-bound job with error of zero is the same as anexact job that requires all its tasks to complete. Hence, by focusing onapproximation jobs, we automatically subsume exact computations.

1

the saved slot. To understand this opportunity cost, con-sider a cluster with one unoccupied slot and a stragglertask. Letting the straggler complete takes five more timeunits while a new copy of it would take four time units.Scheduling a speculative copy for this straggler speeds itup by one time unit, however, if we were not to, that slotcould finish another task (taking five time units too).

This simple intuition of opportunity cost forms the ba-sis for our two design proposals. First, Greedy Spec-ulative (GS) scheduling is an algorithm that greedilypicks the task to schedule next (original or speculative)that most improves the approximation goal at that point.Second, Resource Aware Speculative (RAS) schedulingconsiders the opportunity cost and schedules a specula-tive copy only if doing so saves both time and resources.

These two designs are motivated by first principlesanalysis within the context of a theoretical model forstudying speculative scheduling. An important guidelinefrom our model is that the value of being greedy (GS)increases for smaller jobs while considering opportunitycost of speculation (RAS) helps for larger jobs. As ourmodel is generic, a nice aspect is that the guideline holdsnot only for approximation jobs but also for exact jobsthat require all their tasks to complete.

We use the above guideline to dynamically combineGS and RAS, which we call GRASS. At the beginningof a job’s execution, GRASS uses RAS for schedulingtasks. Then, as the job gets close to its approximationbound, it switches to GS, since our theoretical modelsuggests that the opportunity cost of speculation dimin-ishes with fewer unscheduled tasks in the job. GRASSlearns the point to switch from RAS to GS using job andcluster characteristics.

We demonstrate the generality of GRASS by imple-menting it in both Hadoop [14] (for batch jobs) andSpark [15] (for interactive jobs). We evaluate GRASSusing production workloads from Facebook and Bing onan EC2 cluster with 200 machines. GRASS increasesaccuracy of deadline-bound jobs by 47% and speeds uperror-bound jobs by 38% compared to state-of-the-artstraggler mitigation techniques deployed in these clus-ters (LATE [2] and Mantri [1]). In fact, GRASS resultsin near-optimal performance. In addition, GRASS alsospeeds up exact jobs, that require all their tasks to com-plete, by 34%. Thus, it is a unified speculation solutionfor both approximation as well as exact computations.

2 Challenges and Opportunities

Before presenting our system design, it is important tounderstand the challenges and opportunities for speculat-ing straggler tasks in the context of approximation jobs.

2.1 Approximation JobsIncreasingly, with the deluge of data, analytics applica-tions no longer require processing entire datasets. In-stead, they choose to tradeoff accuracy for response time.Approximate results obtained early from just part of thedataset are often good enough [4, 6, 5]. Approximationis explored across two dimensions—time for obtainingthe result (deadline) and error in the result [7].

• Deadline-bound jobs strive to maximize the accu-racy of their result within a specified time limit.Such jobs are common in real-time advertisementsystems and web search engines. Generally, the jobis spawned on a large dataset and accuracy is pro-portional to the fraction of data processed [8, 9, 10](or tasks completed, for ease of exposition).

• Error-bound jobs strive to minimize the time takento reach a specified error limit in the result. Again,accuracy is measured in the amount of data pro-cessed (or tasks completed). Error-bound jobs areused in scenarios where the value in reducing theerror below a limit is marginal, e.g., counting of thenumber of cars crossing a section of a road to thenearest thousand is sufficient for many purposes.

Approximation jobs require schedulers to prioritizethe appropriate subset of their tasks depending on thedeadline or error bound. Prioritization is important fortwo reasons. First, due to cluster heterogeneities [2, 3,16], tasks take different durations even if assigned thesame amount of work. Second, jobs are often multi-waved, i.e., their number of tasks is much more thanavailable compute slots [17], thereby they run only afraction of their tasks at a time. The trend of multi-wavedjobs is only expected to grow with smaller tasks [18].

2.2 ChallengesThe main challenge in prioritizing tasks of approxima-tion jobs arises due to some of them straggling. Evenafter applying many proactive techniques, in productionclusters in Facebook and Microsoft Bing, the averagejob’s slowest task is eight times slower than the median.2

Further, it is difficult to model all the complex interac-tions in clusters to prevent stragglers [3, 19].

The widely adopted technique to deal with stragglertasks is speculation. This is a reactive technique thatspawns speculative copies for tasks deemed to be strag-gling. The earliest among the original and speculativecopies is picked while the rest are killed. While schedul-ing a speculative copy makes the task finish faster andthereby increases accuracy, they also compete for com-pute slots with the unscheduled tasks.

2Task durations are normalized by their input sizes.

2

Therefore, our problem is to dynamically prioritizetasks based on the deadline/error-bound while choosingbetween speculative copies for stragglers and unsched-uled tasks. This problem is, unfortunately, NP-Hard anddevising good heuristics (i.e., with good approximationfactors) is an open theoretical problem.

2.3 Potential GainsGiven the challenges posed by stragglers discussedabove, it is not surprising that the potential gains frommitigating their impact are significant. To highlight thiswe use a simulator with an optimal bin-packing sched-uler. Our baselines are the the state-of-the-art mitigationstrategies (LATE [2] and Mantri [1]) in the productionclusters. Optimally prioritizing the tasks while correctlybalancing between speculative copies and unscheduledtasks presents the following potential gains. Deadline-bound jobs improve their accuracy by 48% and 44%, inthe Facebook and Bing traces, respectively. Error-boundjobs speed up by 32% and 40%. We next develop anonline heuristic to achieve these gains.

3 Speculation Algorithm Design

The key choice made by a cluster scheduling algorithmis to pick the next task to schedule given a vacant slot.Traditionally, this choice is made among the set of tasksthat are queued; however when speculation is allowed,the choice also includes speculative copies of tasks thatare already running. This extra flexibility means thata design must determine a prioritization that carefullyweighs the gains from speculation against the cost ofextra resources while still meeting the approximationgoals. Thus, we first focus on tradeoffs in the designof the speculation policy. Specifically, using both smallexamples and analytic modeling we motivate the use oftwo simple heuristics, Greedy Speculative (GS) schedul-ing and Resource Aware Speculative (RAS) schedulingthat together make up the core of GRASS.

3.1 Speculation AlternativesFor simplicity, we first introduce GS and RAS in thecontext of deadline-bound jobs and then briefly describehow they can be adapted to error-bound jobs.

3.1.1 Deadline-bound Jobs

If speculation was not allowed, there is a natural, well-understood policy for the case of deadline-bound jobs:Shortest Job First (SJF), which schedules the task withthe smallest processing time. In many settings, SJF canbe proven to minimize the number of incomplete tasks

Figure 1: GS and RAS for a deadline-bound job with 9tasks. The trem and tnew values are when T2 finishes. Theexample illustrates deadline values of 3 and 6 time units.

in the system, and thus maximize the number of taskscompleted, at all points of time among the class of non-preemptive policies [11, 12]. Thus, without speculation,SJF finishes the most tasks before the deadline.

If one extends this idea to the case where speculationis allowed, then a natural approach is to allow the tasksthat are currently running to also be placed in the queue,and to choose the task with the smallest size, i.e., tnew(requiring, of course, that the task can finish before thedeadline). Then, if the chosen task has a copy currentlyrunning, we check that the new speculative copy beingconsidered provides a benefit, i.e., tnew < trem. So, thenext task to run is still chosen according to SJF, onlynow speculative copies are also considered. We term thispolicy Greedy Speculative (GS) scheduling, because itpicks the next task to schedule greedily, i.e., the one thatwill finish the quickest, and thus improve the accuracythe earliest at present.

Figure 1 (left) presents an illustration of GS for a sim-ple job with nine tasks and two concurrent slots. TasksT1 and T2 are scheduled first, and when T2 finishes, thetrem and tnew values are as indicated. At this point, GSschedules T3 next as it is the one with the lowest tnew,and so forth. Assuming the deadline was set to 6 timeunits, the obtained accuracy is 7

9 (or 7 completed tasks).Picking the earliest task to schedule next is often op-

timal when a job has no unscheduled tasks (i.e., eithersingle-waved jobs or the last wave of a multi-waved job).However, when there are unscheduled tasks it is lessclear. For example, in Figure 1 (right) if we schedulea speculative copy of T1 when T2 finished, instead ofT3, 8 tasks finish by the deadline of 6 time units.

The previous example highlights that running a spec-ulative copy has resource implications which are impor-tant to consider. If the speculative copy finishes early,both slots (of the speculative copy and the original) be-come available sooner to start the other tasks. This op-portunity cost of speculation is an important tradeoff toconsider, and leads to the second policy we consider: Re-source Aware Speculative (RAS) scheduling.

3

1: procedure DEADLINE(〈Task〉 T , float δ, bool OC). OC = 1→ use RAS; 0→ use GS

2: if OC then3: for each Task t in T do4: if t.running then

t.saving = t.c ×t.trem − (t.c+1)× tnew

. PRUNING STAGEδ’← Remaining Time to δ〈Task〉Γ← φ

5: for each Task t in T do6: if t.tnew > δ’ then continue . Exceeds deadline7: if OC then8: if t.saving > 0 then Γ.add(t)9: else

10: if t.running then11: if t.tnew < t.trem then Γ.add(t)12: else Γ.add(t)

. SELECTION STAGE13: if Γ 6= null then14: if OC then SortDescending(Γ, “saving”)15: else SortAscending(Γ, tnew)

return Γ.first()

Pseudocode 1: GS and RAS algorithms for deadline-bound jobs (deadline of δ). T is the set of unfinished taskswith the following fields per task: trem, tnew, and a boolean“running” to denote if a copy of it is currently executing.RAS is used when OC is set. At default, both algorithmsschedule the task with the lowest tnew within the deadline.

To account for the opportunity cost of scheduling aspeculative copy, RAS speculates only if it saves bothtime and resources. Thus, not only must tnew be lessthan trem to spawn a speculative copy but the sum of theresources used by the speculative and original copies,when running simultaneously, must be less than lettingjust the original copy finish. In other words, for a taskwith c running copies, its resource savings, defined asc× trem − (c+ 1)× tnew, must be positive.

By accounting for the opportunity cost of resources,RAS can out-perform GS in many cases. As mentionedearlier, in Figure 1 (right) where RAS achieves an ac-curacy of 8

9 versus GS’s 79 in the deadline of 6 time

units. This improvement comes because, when T2 fin-ishes, speculating on T1 saves 1 unit of resource.

However, RAS is not uniformly better than GS. In par-ticular, RAS’s cautious approach can backfire if it over-estimates the opportunity cost. In the same example inFigure 1, if the deadline of the job were reduced from6 time units to 3 time units instead, GS performs bet-ter than RAS. At the end of 3 time units, GS has led tothree completed tasks while RAS has little to show forits resource gains by speculating T1.

As the example alludes to, the value of the deadlineand the number of waves are two important factors im-pact whether GS or RAS is a better choice. A third im-

1: procedure ERROR(〈Task〉 T , float ε, bool OC). OC = 1→ use RAS; 0→ use GS

. Error ε is in #tasks2: for each Task t in T do

t.duration = min(t.trem, t.tnew)3: if OC then4: if t.running then

t.saving = t.c ×t.trem − (t.c+1)× tnew

. PRUNING STAGESortAscending(T , “duration”)〈Task〉Γ← φ

5: for each Task t in T [0 : T .count() (1− ε)] do. Earliest tasks

6: if OC then7: if t.saving > 0 then Γ.add(t)8: else9: if t.running then

10: if t.tnew < t.trem then Γ.add(t)11: else Γ.add(t)

. SELECTION STAGE12: if Γ 6= null then13: if OC then SortDescending(Γ, “saving”)14: else SortDescending(Γ, trem)

return Γ.first()

Pseudocode 2: GS and RAS speculation algorithms forerror-bound jobs (error-bound of ε). T is the set of un-finished tasks with the following fields per task: trem, tnew,and a boolean “running” to denote if a copy of it is cur-rently executing. The trem of the task is the minimum of allits running copies. RAS is used when OC is set. At default,both algorithms schedule the task with the highest trem.

portant factor, which we discuss later in §4.1, is the esti-mation accuracy of trem and tnew.

Pseudocode 1 describes the details of GS and RAS.The set T consists of all the running and unscheduledtasks of the jobs. There are two stages in the schedulingprocess: (i) Pruning Stage: In this stage (lines 5 − 12),tasks that are not slated to complete by the deadline areremoved from consideration. Further, GS removes thosetasks whose speculative copy is not expected to finishearlier than the running copy. RAS removes those taskswhich do not save on resources by speculation. (ii) Se-lection Stage: From the pruned set, GS picks the taskwith the lowest tnew while RAS picks the task with thehighest resource savings (lines 13− 15).

3.1.2 Error-bound Jobs

Though error-bound jobs require a different form ofprioritization than deadline-bound jobs, the speculativecore of the GS and RAS algorithms are again quite natu-ral. Specifically, the goal of error-bound jobs is to mini-mize the makespan of the tasks needed to achieve the er-ror limit. Thus, instead of SJF, Longest Job First (LJF) is

4

Figure 2: GS and RAS for error-bound job with 6 tasks.The trem and tnew values are when T2 finishes. The exampleillustrates error limit of 40% (3 tasks) and 20% (4 tasks).

the natural prioritization of tasks. In particular, LJF min-imizes the makespan among the class of non-preemptivepolicies in many settings [11, 12]. This again representsa “greedy” prioritization for this setting.

Despite the above change to the prioritization of whichtask to schedule, the form of GS and RAS remain thesame as in the case of deadline-bound jobs. In particular,speculative copies are evaluated in the same manner, e.g.,RAS’s criterion is still to pick the task whose specula-tion leads to the highest resource savings. Pseudocode 2presents the details. The pruning stage (lines 5 − 11)will remove from consideration those tasks that are notthe earliest to contribute to the desired error bound. Thelist of earliest tasks is based on the effective duration ofevery task, i.e., the minimum of trem and tnew. During se-lection (lines 12−14), GS picks the task with the highesttrem while RAS picks the task with the highest saving.

Figure 2 presents an illustration of GS and RAS for anerror-bound job with 6 tasks and 3 compute slots. Thetrem and tnew values are at 5 time units. GS decides tolaunch a copy of T3 as it has the highest trem. RAS con-servatively avoids doing so. Consequently, when the er-ror limit is high (say, 40%) GS is quicker, but RAS isbetter when the limit decreases (to, say, 20%).

3.2 Contrasting GS and RASTo this point, we have seen that GS and RAS are two nat-ural approaches for integrating speculation into a clus-ter scheduler for approximation jobs. However, the ex-amples we have considered highlight that neither of GSor RAS is uniformly better. In order to develop a bet-ter understanding of these two algorithms, as well asother possible alternatives, we have developed a sim-ple analytic model for speculation in approximation jobs.The model assumes wave-based scheduling and constantwave-width for a job (see §A for details along with for-mal results). For readability, here we present only thethree major guidelines from our analysis.

Guideline 1 During the early waves of a job, specu-

1 2 3 4

x 106

0

1

2

3

4

β = 1.259

order statistics

Hil

l e

sti

ma

te o

f β

Figure 3: Hill plot of Face-book task durations.

1 2 3 4 51

1.05

1.1

GS RAS

ω

Pro

cessin

g T

ime/O

pti

mal

5 waves

4 waves

3 waves

2 waves

1 waves

Figure 4: Near-optimalityof GS & RAS under Paretotask durations (β = 1.259).

lation is only valuable if task durations are extremelyheavy tailed, e.g., Pareto with infinite variance (i.e., withshape parameter β < 2). In this case, it is optimal tospeculate conservatively, using ≤ 2 copies of a task.

Task durations are indeed heavy-tailed for the Facebookand Bing traces, as illustrated by the Hill plot3 in Figure3. Task durations have a Pareto tail with shape parameterβ = 1.259. While both GS and RAS speculate duringearly waves, RAS is more conservative than GS and thusoutperforms it during early waves.

Guideline 2 During the final wave of a job, speculateaggressively to fully utilize the allotted capacity.

Even if all tasks are currently scheduled, if a slot be-comes available it should be filled with a speculativecopy. Note that both GS and RAS do this to some extent,but since GS speculates more aggressively than RAS itoutperforms RAS during the final wave.

Guideline 3 For jobs that require more than two wavesRAS is near-optimal, while for jobs that require fewerthan two waves GS is near-optimal.

To make this point more salient, consider an arbitraryspeculative policy that waits until a task has run ω timebefore starting a speculative copy (see §A). GS andRAS correspond to particular rules for choosing ω. Totranslate them into the model, we define tnew = E[τ ]and trem = E[τ − ω|τ > ω], where τ is a randomtask size. Then, under GS, ω is the time when E[τ ] =E[τ − ω|τ > ω], and, under RAS, ω is the time when2E[τ ] = E[τ − ω|τ > ω].

Figure 4 shows the ratio of the response time normal-ized to the optimal duration for jobs of differing num-bers of waves, with parameter ω ∈ [0, 5]. GS and RASare shown via vertical lines. The figure shows that nei-ther GS or RAS is universally optimal, but each is near-optimal over a range of job types.

3A Hill plot provides a more robust estimation of Pareto distribu-tions than regression on a log-log plot [20]. The fact that the curve isflat over a large range of scales (on the x-axis), but not all scales, indi-cates that the whole distribution is likely not Pareto, but that the tail ofthe distribution is well-approximated by a Pareto tail.

5

This motivates a system design that starts using RASfor early waves and then switches to GS for the finaltwo waves. However, in practice, identifying the “finaltwo waves” is difficult since this requires predicting howmany tasks will complete either before the deadline orerror limit is reached. Hence, we interpret this guidelineas when the deadline is loose or the error limit is low,then RAS is better, while otherwise GS performs better,mimicking the intuition from the examples in §3.1.

4 GRASS Speculation Algorithm

In this section, we build our speculation algorithm calledGRASS.4 Our theoretical analysis summarized in §3.2highlights that it is desirable to use RAS during the earlywaves of jobs and GS during the final two waves. A sim-ple strawman solution to achieve this would be as fol-lows. For deadline-bound jobs, switch from RAS to GSwhen the time to the deadline is sufficient for at most twowaves of tasks. Similarly, for error-bound jobs, switchwhen the number of (unique) scheduled tasks needed tosatisfy the error-bound makes up two waves.

Identifying the final two waves of tasks is difficult inpractice. Tasks are not scheduled at explicit wave bound-aries but rather as and when slots open up. In addition,the wave-width of jobs does not stay constant but variesconsiderably depending on cluster utilization. Finally,task durations are varied and hard to estimate.

The complexities in these systems mean that preciseestimates of the optimal switching point cannot be ob-tained from our model. Instead, we adopt an indi-rect learning based approach where inferences are madebased on executions of previous jobs (with similar num-ber of tasks) and cluster characteristics (utilization andestimation accuracy). We compare our learning ap-proach to the strawman in §6.3, and show that the im-provement is dramatic.

4.1 Learning the Switching Point

An ideal approach would accumulate enough samples ofjob performance (accuracy or completion time) based onswitching to GS at different points. For deadline-boundjobs, this is decided by the remaining time to the dead-line. For error-bound jobs, this is decided by the numberof tasks to complete towards meeting the error. To speedup our sample collection, instead of accumulating sam-ples of switching to GS, we simply get samples of jobperformance by using GS or RAS throughout the job.

An incoming job starts with RAS and periodicallycompares samples of jobs smaller than its size duringits execution to check if it is better to switch to GS. It

4GRASS comes from the concatenation of GS and RAS.

checks by using the remaining work at any point (mea-sured in time remaining or tasks to complete) to calculatethe effect of switching to GS. It steps through all possi-ble points in its remaining work at which it could switchand estimates the optimal point using job samples of ap-propriate sizes. It continues with RAS until the optimalswitching point turns out to be at present. The abovecalculation for the optimal switching point is performedperiodically during the job’s execution.

The optimal switching point changes with time be-cause the size of the job alone is insufficient for thecalculation. Even jobs of comparable size might havedifferent number of waves depending on the number ofavailable slots. Therefore, we augment our samples ofjob performance with the number of waves of execution,simply approximated using current cluster utilization.

Finally, estimation accuracy of trem and tnew also de-cides the optimal switching point. RAS’s cautious ap-proach of considering the opportunity cost of speculat-ing a task is valuable when task estimates are erroneous.In fact, at low estimation accuracies (along with certainvalues of utilization and deadline/error-bound), it is bet-ter to not switch to GS at all and employ RAS all along.

Therefore, GRASS obtains samples of job per-formance with both GS and RAS across values ofdeadline/error-bound, estimation accuracy of trem andtnew, and cluster utilization. It uses these three factorscollectively to decide when (and if) to switch from RASto GS. We next describe how the samples are collected.

4.2 Generating Samples

Generating samples of job performance in online sched-ulers presents a dichotomy. On the one hand, GRASSpicks the appropriate point to switch to GS based on thesamples thus far. However, on the other hand, it has tocontinuously update its samples to stay abreast with dy-namic changes in clusters. Updating samples, in turn, re-quires it to pick GS or RAS for the entire duration of thejob. To cope with this exploration–exploitation tradeoff,we introduce a perturbation in GRASS’s decision. Witha small probability ξ, we pick GS or RAS for the entireduration of the job; GS and RAS are equally probable.Such perturbation helps us obtain comparable samples.

The crucial trade-off in setting ξ is in balancing thebenefit of obtaining such comparable samples with theperformance loss incurred by the job due to not mak-ing the right switching decision. Theoretical analyses ofsuch situations in prior work defines an optimal value ofξ by making stochastic assumptions about the distribu-tion of the costs and the associated rewards [21, 22]. Oursetup, however, does not yield itself to such assumptionsas the underlying distribution can be arbitrary.

Therefore, we pick a constant value of ξ using empiri-

6

cal analysis. A job is marked for generating performancesamples with a probability of ξ, and we pick GS or RASwith equal probability. Further, in practice, we bucketjobs by their number of tasks and compare only withinjobs of the same bucket.

5 Implementation

We implement GRASS on top of two data-analyticsframeworks, Hadoop (version 0.20.2) [14] and Spark(version 0.7.3) [15], representing batch jobs and inter-active jobs, respectively. Hadoop jobs read data fromHDFS while Spark jobs read from in-memory RDDs.Consequently, Spark tasks finished quicker than Hadooptasks, even with the same input size. Note that whileHadoop and Spark use LATE[2] currently, we also im-plement Mantri[1] to use as a second baseline.

Implementing GRASS required two changes: task ex-ecutors and job scheduler. Task executors were aug-mented to periodically report progress. We piggyback onexisting update mechanisms of tasks that conveyed onlytheir start and finish. Progress reports were configured tobe sent every 5% of data read/written. The job schedulercollects these reports, maintains samples of completedtasks and jobs, and decides the switching point.

5.1 Task EstimatorsGRASS uses two estimates for tasks: remaining durationof a running task (trem) and duration of a new copy (tnew).Estimating trem: Tasks periodically update the sched-uler with reports of its progress. A progress report con-tains the fraction of input data read, and the output datawritten. Since tasks of analytics jobs are IO-intensive,we extrapolate the remaining duration of the task basedon the time elapsed thus far.Estimating tnew: We log durations of all completed tasksof a job and estimate the duration of a new task by sam-pling from the log. We normalize the durations to theinput and output sizes. The tnew values of all unfinishedtasks are updated whenever a task completes.Accuracy of estimation: While the above techniquesare simple, the downside is the error in estimation. Ourestimates of trem and tnew achieve moderate accuracies of72% and 76%, respectively, on average. When a taskcompletes, we update the accuracy using the estimatedand actual durations. GRASS uses the accuracy of esti-mation to appropriately switch from RAS to GS.

5.2 DAG of TasksJobs are typically composed as a DAG of tasks with in-put tasks (e.g., map or extract) reading data from the un-derlying storage and intermediate tasks (e.g., reduce or

join) aggregating their outputs. Even in DAGs of tasks,the accuracy of the result is dominated by the fraction ofcompleted input tasks. This makes GRASS’s functioningstraightforward in error-bound jobs—complete as manyinput tasks as required to meet the error-bound and allintermediate tasks further in the DAG.

For deadline-bound jobs, we use a widely occurringproperty that intermediate tasks perform similar func-tions across jobs. Further, they have relatively fewerstragglers. Thus, we estimate the time taken for interme-diate tasks by comparing jobs of similar sizes and thensubtract it to obtain the deadline for the input tasks.

6 Evaluation

We evaluate GRASS on a 200 node EC2 cluster.Our focus is on quantifying the performance improve-ments compared to current designs, i.e., LATE [2] andMantri [1], and on understanding how close to the opti-mal performance GRASS comes. Further, we illustratethe impact of the design decisions such as learning theswitching point between RAS and GS. Our main resultscan be summarized as follows.

1. GRASS increases accuracy of deadline-bound jobsby 47% and speeds up error-bound jobs by 38%.Even non-approximation jobs (i.e., error-bound ofzero) speed up by 34%. Further, GRASS nearlymatches the optimal performance. (§6.2)

2. GRASS’s learning based approach for determiningwhen to switch from RAS to GS is over 30% betterthan simple strawman techniques. Further, the useof all three factors discussed in §4.1 is crucial forinferring the optimal switching point. (§6.3)

6.1 MethodologyWorkload: Our evaluation is based on traces fromFacebook’s production Hadoop [14] cluster and Mi-crosoft Bing’s production Dryad [23] cluster. The tracescapture over half a million jobs running across manymonths (Table 1). The clusters run a mix of interactiveand production jobs whose performance have significantimpact on productivity and revenue. To create our exper-imental workload, we retain the inter-arrival times, inputfiles and number of tasks of jobs. The jobs were, how-ever, not approximation queries and required all theirtasks to complete. Hence, we convert the jobs to mimicdeadline- and error-bound jobs as follows.

For experiments on error-bound jobs, we pick the er-ror tolerance of the job randomly between 5% and 30%.This is consistent with the experimental setup in recentlyreported research [4, 24]. Prior work also recommendssetting deadlines by calibrating task durations [4, 9]. Forthe purpose of calibration, we obtain the ideal duration of

7

Facebook Microsoft BingDates Oct 2012 May-Dec 2011Framework Hadoop DryadScript Hive [25] Scope [26]Jobs 575K 500KCluster Size 3,500 ThousandsStraggler– LATE [2] Mantri [1]mitigation

Table 1: Details of Facebook and Bing traces.

a job in the trace by substituting the duration of each ofits task by the median task duration in the job, again, asper recent work on straggler mitigation [3]. We set thedeadline to be an additional factor (randomly between2% to 20%) on top of this ideal duration.Job Bins: We show our experimental results dependingon the size of the jobs (i.e., the number of tasks). Weuse three distinctions “small” (< 50 tasks), “medium”(51 − 500 tasks), and “large” (> 500 tasks). Note thatthe Bing workload has more large jobs and fewer smalljobs than the Facebook workload.EC2 Deployment: We deploy our Hadoop and Sparkprototypes on a 200-node EC2 cluster and evaluate themusing the workloads described above. Each experimentis repeated five times and we pick the median. We mea-sure improvement in the average accuracy for deadline-bound jobs and average duration for error-bound jobs.

We also use a trace-driven simulator to evaluate atlarger scales and over longer durations.Baseline: We contrast GRASS with two state-of-the-artspeculation algorithms—LATE [2] and Mantri [1].

6.2 Improvements from GRASS

We contrast GRASS’s performance with that ofLATE [2], Mantri [1], and the optimal scheduler.

6.2.1 Deadline-bound jobs

GRASS improves the accuracy of deadline-bound jobsby 34% to 40% in the Hadoop prototype. Gains in boththe Facebook and Bing workloads are similar. Figure 5aand 5b split the gains by job size. The gains comparedto LATE as baseline are consistently higher than Mantri.Also, the gains in large jobs are pronounced compared tosmall and medium sized jobs because their many wavesof tasks provides plenty of potential for GRASS.

The Spark prototype improves accuracy by 43% to47%. The gains are higher because Spark’s task sizes aremuch smaller than Hadoop’s due to in-memory inputs.This makes the effect of stragglers more distinct. Again,large jobs gain the most, improving by over 50% (Fig-ure 5c and 5d). Large multi-waved jobs improving moreis encouraging because smaller task sizes in future [18]will ensure that multi-waved executions will be the norm

0

10

20

30

40

50

< 50 51-500 > 501

Baseline:LATE Baseline:Mantri

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

(a) Facebook Workload–Hadoop

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

0

10

20

30

40

50

< 50 51-500 > 501


(b) Bing Workload–Hadoop

0

10

20

30

40

50

60

< 50 51-500 > 501


Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

(c) Facebook Workload–Spark

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

0

10

20

30

40

50

< 50 51-500 > 501


(d) Bing Workload–SparkFigure 5: Accuracy Improvement in deadline-bound jobswith LATE [2] and Mantri [1] as baselines.

0

10

20

30

40

50

2-5 6-10 11-15 16-20

Facebook Bing

Deadline (%) Bin

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

(a) Deadline Bins

0

10

20

30

40

5-10 11-15 16-20 21-25 26-30

Facebook Bing

Error (%) BinIm

pro

ve

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n(b) Error Bins

Figure 6: GRASS’s gains (over LATE) binned by the dead-line and error bound. Deadlines are binned by the factorover ideal job duration (§6.1)

going forward. Unlike the Hadoop case, the gains com-pared to both LATE and Mantri are similar. Both LATEand Mantri have only limited efficacy when the impactof stragglers is high.

Figure 6a dices the improvements by the deadline(specifically, the additional factor over the ideal job du-ration (see §6.1)). Note that gains are nearly uniformacross deadline values. This indicates that GRASS cannot only cope with stringent deadlines but be valuableeven when the deadline is lenient.

Gains with simulations are consistent with deploy-ment, indicating not only that GRASS’s gains hold overlonger durations but also the simulator’s robustness.

6.2.2 Error-bound jobs

Similar to deadline-bound jobs, improvements with theSpark prototype (33% to 37%) are higher compared to

8

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

0

10

20

30

40

50

< 50 51-500 > 501



Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

0

10

20

30

40

< 50 51-500 > 501


(b) Bing Workload–Hadoop

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

0

10

20

30

40

50

< 50 51-500 > 501


(c) Facebook Workload–Spark

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

0

10

20

30

40

50

< 50 51-500 > 501


(d) Bing Workload–SparkFigure 7: Speedup in error-bound jobs with LATE [2] andMantri [1] as baselines.

the Hadoop prototype (24% to 30%). This shows thatGRASS works well not only with established frame-works like Hadoop but also upcoming ones like Spark.

Note that the gains are indistinguishable among differ-ent job bins (Figures 7a and 7b) in the Spark prototype;large jobs gain a touch more in the Hadoop prototype(Figures 7c and 7d). Again, our simulation results areconsisten with deployment, and so are omitted.

As Figure 6b shows, GRASS’s gains persist at bothtight as well as moderate error bounds. At high errorbounds, there is smaller scope for GRASS beyond LATE.The gains at tight error bounds is noteworthy becausethese jobs are closer to exact jobs that require all (or mostof) their tasks to complete. In fact, exact jobs speed upby 34%, thus making GRASS valuable even in clustersthat are yet to deploy approximation analytics.

6.2.3 Optimality of GRASS

While the results above show the speed up GRASS pro-vides, the question remains as to whether further im-provements are possible. To understand the room avail-able for improvement beyond GRASS, we compare itsperformance with an optimal scheduler that knows taskdurations and slot availabilities in advance.

Figure 8 shows the results for the Facebook workloadwith Spark. It highlights that GRASS’s performancematches the optimal for both deadline as well as error-bound jobs. Thus, GRASS is an efficient near-optimalsolution for the NP-hard problem of scheduling tasks forapproximation jobs with speculative copies.

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

0

10

20

30

40

50

60

< 50 51-500 > 501

GRASS Optimal

(a) Deadline-bound Jobs

0

10

20

30

40

50

< 50 51-500 > 501

GRASS Optimal

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

(b) Error-bound JobsFigure 8: GRASS’s gains matches the optimal scheduler.

Length of DAG

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

0

10

20

30

40

50

2 3 4 5 6

Bing Facebook

(a) Deadline-bound Jobs.

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

Length of DAG

0

10

20

30

40

2 3 4 5 6

Bing Facebook

(b) Error-bound Jobs.Figure 9: GRASS’s gains holds across job DAG sizes.

6.2.4 DAG of tasks

To complete the evaluation of GRASS we investigatehow performance gains depend on the length of the job’sDAG. Intuitively, as long as our estimation of interme-diate phases is accurate, GRASS’s handling of the inputphase should remain unchanged, and Figure 9 confirmsthis for both deadline and error-bound jobs. Gains fromGRASS remain stable with the length of the DAG.

6.3 Evaluating GRASS’s Design DecisionsTo understand the impact of the design decisions made inGRASS, we focus on three questions. First, how impor-tant is it that GRASS switches from RAS to GS? Second,how important is it that this switching is learned adap-tively rather than fixed statically? Third, how sensitiveis GRASS to the perturbation factor ξ? In the interestof space, we present results on these topics for only theFacebook workload using LATE as a baseline; results forthe Bing workload with Mantri are similar.

6.3.1 The value of switching

To understand the importance of switching between RASand GS we compare GRASS’s performance with usingonly GS and RAS all through the job. Figure 10 performsthe comparison for deadline-bound jobs. GRASS’s im-provements, both on average as well as in individual jobbins, are strictly better than GS and RAS. This showsthat if using only one of them is the best choice, GRASS

9

0

10

20

30

40

50

< 50 51-500 > 501

GS-only RAS-only GRASS

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

(a) Hadoop

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

0

10

20

30

40

50

60

< 50 51-500 > 501


(b) SparkFigure 10: GRASS’s switching is 25% better than usingGS or RAS all through for deadline-bound jobs. We usethe Facebook workload and LATE as baseline.

0

10

20

30

40

< 50 51-500 > 501


Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n


0

10

20

30

40

50

< 50 51-500 > 501


Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

(b) Facebook Workload–SparkFigure 11: GRASS’s switching is 20% better than usingGS or RAS all through for error-bound jobs. We use theFacebook workload and LATE as baseline.

0

10

20

30

40

50

60

< 50 51-500 > 501

Strawman GRASS

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy


0

10

20

30

40

50

< 50 51-500 > 501

Strawman GRASS

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

Job Bin (#Tasks)

(b) Error-bound JobsFigure 12: Comparing GRASS’s learning based switchingapproach to a strawman that approximates two waves oftasks. GRASS is 30%− 40% better than the strawman.

automatically avoids switching. Further, GRASS’s over-all improvement in accuracy is over 20% better than thebest of GS or RAS, demonstrating the value of switchingas the job nears its deadline. The above trends are con-sistent with error-bound jobs as well (Figure 11), thoughGRASS’s benefit is slightly lower.

The contrast of GS and RAS is also interesting. GSoutperforms RAS for small jobs but loses out as job sizesincrease. The significant degradation in performance inthe unfavorable job bin (medium and large jobs for GS,versus small jobs for RAS) illustrates the pitfalls of stat-ically picking the speculation algorithm.

6.3.2 The value of learning

Given the benefit of switching, the question becomeswhen this switching should occur. GRASS does thisadaptively based on three factors: deadline/error-bound,cluster utilization and estimation accuracy of trem andtnew. Now, we illustrate the benefit of this approachcompared to simpler options, i.e., choosing the switch-ing point statically or based on a subset of these threefactors. Note that we have already seen that these threefactors are enough to be near optimal (Figure 8).Static switching: First, when considering a static de-sign, a natural “strawman” based on our theoretical anal-ysis is to estimate the point when there are two remainingwaves as follows. For deadline-bound jobs, it is the pointwhen the time to the deadline is sufficient for at mosttwo waves of tasks. For error-bound jobs, it is when thenumber of (unique) scheduled tasks sufficient to satisfythe error-bound make up two waves. The strawman usesthe current wave-width of the job and assumes task du-rations to be median of completed tasks.

Figure 12 compares GRASS with the above strawman.Gains with the strawman are 66% and 73% of the gainswith GRASS for deadline-bound and error-bound jobs,respectively. Small and medium jobs lag the most aswrong estimation of switching point affects a large frac-tion of their tasks. Thus, the benefit of adaptively deter-mining the switching point is significant.Adaptive switching: Next, we study the impact ofthe three factors used to adaptively learn the switchingthreshold. To do this, Figures 13 and 14 compares thedesigns using the best one or two factors with GRASS.

When only one factor can be used to switch, pickingthe deadline/error-bound provides the best results. Thisis intuitive given the importance of the approximationbound to the ordering of tasks. When two factors areused, in addition to the deadline/error-bound, cluster uti-lization matters more for the Hadoop prototype whileestimation accuracy is important for the Spark proto-type. Tasks of Hadoop jobs are longer and more sen-sitive to slot allocations, which is dictated by the utiliza-tion. While the smaller Spark tasks are more fungible,this also makes them sensitive to estimation errors.

Using only one factor is significantly worse than us-ing all three factors. The performance picks up withdeadline-bound jobs when two factors are used, buterror-bound jobs’ gains continue to lag until all three areused. Thus, in the absence of a detailed model for jobexecutions, the three factors act as good predictors.

6.3.3 Sensitivity to Perturbation

The final aspect of GRASS that we evaluate is the pertur-bation factor, ξ, which decides how often the schedulerdoes not switch during a job’s execution (see §4.2). This

10

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

0

10

20

30

40

50

< 50 51-500 > 501

Best-1 Best-2 GRASS

(a) Hadoop

0

10

20

30

40

50

60

< 50 51-500 > 501

Best-1 Best-2 GRASS

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

(b) SparkFigure 13: Using all three factors for deadline-bound jobscompared to only one or two is 18%− 30% better.

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

0

10

20

30

40

< 50 51-500 > 501

Best-1 Best-2 GRASS

(a) Hadoop

0

10

20

30

40

50

< 50 51-500 > 501

Best-1 Best-2 GRASS

Job Bin (#Tasks)

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

(b) SparkFigure 14: Using all three factors for error-bound jobscompared to one or two factors is 15%− 25% better.

perturbation is crucial for GRASS’s learning of the opti-mal switching point. All results shown previously set ξto 15%, which was picked empirically.

Figure 15 highlights the sensitivity of GRASS to thischoice. Low values of ξ hamper learning because ofthe lack of sufficient samples, while high values in-cur performance loss resulting from not switching fromRAS to GS often enough. Our results show, that thisexploration–exploitation tradeoff is optimized at ξ =15%, and that performance drops off sharply around thispoint. Deadline-bound jobs are more sensitive to poorchoice of ξ than error-bound jobs. Using ξ of 15%is consistent with studies on multi-armed bandit prob-lems [27], which is related to our learning problem.

7 Related Work

The problem of stragglers was identified in the origi-nal MapReduce paper [28]. Since then solutions havebeen proposed to mitigate them using speculative execu-tions [2, 1, 23]. These solutions, however, are not forapproximation jobs. These jobs require proritizing theright subset of tasks by carefully considering the oppor-tunity cost of speculation. Further, our evaluations showthat GRASS speeds up even for exact jobs that require alltheir tasks to complete. Thus, it is a unified solution thatcluster schedulers can deploy for both approximation aswell as non-approximation computations.

0

10

20

30

40

50

0 5 10 15 20

Facebook

Bing

Imp

rove

me

nt

(%)

in

Ave

rag

e A

ccu

racy

Perturbation (ξ)


0

10

20

30

40

0 5 10 15 20

Facebook

Bing

Imp

rove

me

nt

(%)

in

Ave

rag

e J

ob

Du

ratio

n

Perturbation (ξ)

(b) Error-bound JobsFigure 15: Sensitivity of GRASS’s performance to the per-turbation factor ξ. Using ξ = 15% is empirically best.

Prioritizing tasks of a job is a classic scheduling prob-lem with known heuristics [11, 12]. These heuristics as-sume accurate knowledge of task durations and hence donot require speculative copies to be scheduled dynami-cally. Estimating task durations accurately, however, isstill an open challenge as acknowledged by many stud-ies [3, 19]. This makes speculative copies crucial andwe develop a theoretically backed solution to optimallyprioritize tasks with speculative copies.

Modeling real world clusters has been a challengefaced by other schedulers too. Recently reported re-search has acknowledged the problem of estimating taskdurations [16], predicting stragglers [3, 19] as well asmodeling multi-waved job executions [17]. Their so-lutions primarily involve sidestepping the problem bynot predicting stragglers and upfront replicating thetasks [3], or approximating number of waves to filesizes [17]. Such sidestepping, however, is not an optionfor GRASS and hence we build tailored approximations.

Finally, replicating tasks in distributed systems have along history [29, 30, 31] with extensive studies in priorwork [32, 33, 34]. These studies assume replication up-front as opposed to dynamic replication in reaction tostragglers. The latter problem is both harder and un-solved. In this work, we take a stab at this problem thatyields near-optimal results in our production workloads.

8 Concluding Remarks

This paper explores speculative cluster scheduling in thecontext of approximation jobs. From the analysis ofa simple but informative analytic model, we developa speculation algorithm, GRASS, that uses opportunitycost to determine when to speculate early in the job andthen switches to more aggressive speculation as the jobnears its approximation bound. Prototype implementa-tions on Hadoop and Spark, deployed on a 200 node EC2cluster show that GRASS provides 47% improvement inaccuracy of deadline bound jobs and 38% speed for errorbound jobs, in production workloads from Facebook andBing. Further, the evaluation highlights that GRASS is a

11

unified speculation solution for both approximation andexact computations, since it also provides a 34% speedup for exact jobs.

A Modeling and Analyzing Speculation

The model focuses on one job that has T tasks5 and Sslots out of a total capacity normalized to 1. Let the ini-tial job size be x and the remaining amount of work inthe job at time t be x(t). We focus our analysis on therate at which work is completed, which we denote byµ(t;x, S, T ) or µ(t) for short. Note that by focusing onthe service rate we are ignoring ordering of the tasks andfocusing primarily on speculation.

Proactive speculation: We start by consideringproactive policies that launch k(x(t)) speculative copiesof tasks when the job has remaining size x(t). We pro-pose the following approximation for µ(t) in this case.

[(x(t)

x

)(T

S

)k(x(t))

]1k(x(t))

S

·(

E[τ ]

k(x(t))E[min(τ1, . . . , τk(x(t))

])(1)

where τ is a random task size.To understand this approximation, note that the first

term approximates the completion rate of work and thesecond term approximates the “blow up factor,” i.e., theratio of the expected work completed without duplica-tions to the amount of work done with duplications. Tounderstand the first term, note that (x(t)/x)T is the frac-tional number of tasks that remain to be completed attime t, and thus there are (x(t)/x)Tk(x(t)) tasks avail-able to schedule at time t including speculative copies.Recalling that the capacity of a slot is 1/S, that the maxi-mum capacity that can be allocated is 1, and that the min-imum number of slots is the number of copies k(x(t)),we obtain the first term in (1). The second term com-putes the “blow up factor,” which is the the expectedamount of work done per task without speculation (E[τ ])divide by the expected work done per task with specula-tion (k(x(t))E[min(τ1, τ2, . . . , τk(x(t)))], since k(x(t))copies are created and they are stopped when the firstcopy completes.

Given the approximation in (1), the question is: Whatproactive speculation policy minimizes job duration?

Theorem 1 When task sizes are Pareto(xm,β), theproactive speculation policy that minimizes the comple-tion time of the job is

k(x(t)) =

σ,

x(t)xTσ ≥ S

S/(x(t)xT ), S >

x(t)xTσ and x(t)

xT ≥ 1;

S, 1 >x(t)xT .

(2)

where σ = max(2/β, 1).

5For approximation jobs T should be interpreted as the number oftasks that are completed before the deadline or error limit is reached.

This theorem leads to Guidelines 1 and 2. Specifically,the first line corresponds to the “early waves” to the “lastwave”. During the “early waves” the optimal policy mayor may not speculate, depending on the task size distri-bution – speculation happens only when β < 2, which iswhen task sizes have infinite variance. In contrast, duringthe “last wave”, regardless of the task size distribution,the optimal policy speculates to ensure all slots are used.

Reactive speculation: We now turn to reactive spec-ulation policies, which wait until a task has had ω workcompleted before launching any copies. Both GS andRAS are examples of such policies and can be translatedinto choices for ω as described in §3.2.

Our analysis of proactive policies provides importantinsight into the design of reactive policies. In particular,during early waves the the optimal proactive policy runsat most two copies of each task, and so we limit our re-active policies to this level of speculation. Additionally,the previous analysis highlights that during the last wavethe it is best to speculate aggressively in order to use upthe full capacity, and thus it is best to speculated imme-diately without waiting ω time. This yields the followingapproximation for µ(t):

E[τ1]E[τ1|0≤τ1<ω ] Pr(0≤τ1<ω)+(2E[Z−ω|τ1≥ω ]+ω) Pr(τ1>ω)

,

when x(t)xT (Pr(0 ≤ τ1 < ω) + 2Pr(τ1 ≥ ω)) ≥ S.

optimal proactive speculation (from (1)),when x(t)

xT (Pr(0 ≤ τ1 < ω) + 2Pr(τ1 ≥ ω)) < S.

(3)

τ1, τ2 are random task sizes and Z = min(τ1, τ2 + ω).Again, the first line in (3) approximates the service

rate during the early waves of the job, while the secondline approximates the service rate during the final waveof the job. To understand the first line, note that dur-ing early waves there are enough tasks to use capacity1 (in expectation) as long as x(t)

x T (Pr(0 ≤ τ < ω) +2Pr(T ≥ ω)) ≥ S. Thus, all that remains is the “blowup factor.” As before, the numerator is the expectedamount of work per task without speculation (E[τ ]) andthe denominator is the expected amount of work per taskwith reactive speculation. This is E[τ |τ < ω] if the ini-tial copy finishes before ω, and 2E[Z −ω|τ > ω] +ω ifthe initial copy takes longer than ω.

Our design problem can be reduced to finding ω thatminimizes the response time of the job. The complicatedform of (3) makes it difficult to understand the optimalω analytically. Figure 4, therefore, presents a numericaloptimization by comparing GS and RAS to other reactivepolicies. It leads go Guideline 3, which highlights thatGS is near optimal if the number of waves in the job is< 2, while RAS is near-optimal if the number of wavesin the job is ≥ 2.

12

References

[1] G. Ananthanarayanan, S. Kandula, A. Greenberg,I. Stoica, E. Harris, and B. Saha. Reining in theOutliers in Map-Reduce Clusters Using Mantri. InUSENIX OSDI, 2010.

[2] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz,and I. Stoica. Improving MapReduce Performancein Heterogeneous Environments. In USENIXOSDI, 2008.

[3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, andI. Stoica. Effective Straggler Mitigation: Attack ofthe Clones. In USENIX NSDI, 2013.

[4] S.Agarwal, B. Mozafari, A. Panda, H. Milner,S. Madden, and I. Stoica. BlinkDB: Queries withBounded Errors and Bounded Response Times onVery Large Data. In EuroSys. ACM, 2013.

[5] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K.Elmeleegy, and R. Sears. MapReduce Online. InUSENIX NSDI, 2010.

[6] Interactive Big Data analysis using approximateanswers, 2013. http://tinyurl.com/k5favda.

[7] J. Liu, K. Shih, W. Lin, R. Bettati, and J. Chung.Imprecise Computations. Proceedings of the IEEE,1994.

[8] S. Lohr. Sampling: design and analysis. Thomson,2009.

[9] J. Hellerstin, P. Haas, and H. Wang. Online Aggre-gation. In ACM SIGMOD, 1997.

[10] M. Garofalais and P. Gibbons. Approximate QueryProcessing: Taming the Terabytes. In VLDB, 2001.

[11] M. Pinedo. Scheduling: theory, algorithms, andsystems. Springer, 2012.

[12] L. Kleinrock. Queueing systems, volume II: com-puter applications. John Wiley & Sons New York,1976.

[13] M. Lin, J. Tan, A. Wierman, and L. Zhang. JointOptimizaiton of Overlapping Phases in MapRe-duce. In IFIP Performance, 2013.

[14] Hadoop. http://hadoop.apache.org.

[15] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J.Ma, M. McCauley, M. Franklin, S. Shenker, andI. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Com-puting. In USENIX NSDI, 2012.

[16] E. Bortnikov, A. Frank, E. Hillel, S. Rao. Predict-ing Execution Bottlenecks in Map-Reduce Clus-ters. In USENIX HotCloud, 2012.

[17] G. Ananthanarayanan, A. Ghodsi, A. Wang,D. Borthakur, S. Kandula, S. Shenker, and I. Sto-ica. PACMan: Coordinated Memory Caching forParallel Jobs. In USENIX NSDI, 2012.

[18] K. Ousterhout, A. Panda, J. Rosen, S. Venkatara-man, R. Xin, S. Ratnasamy, S. Shenker, and I. Sto-ica. The Case for Tiny Tasks in Compute Clusters.In USENIX HotOS, 2013.

[19] J. Dean. Achieving Rapid Response Times in LargeOnline Services. In Berkeley AMPLab Cloud Sem-inar, 2012.

[20] S. Resnick. Heavy-tail phenomena: probabilisticand statistical modeling. Springer, 2007.

[21] J. C. Gittins. Bandit Processes and Dynamic Al-location Indices. Journal of the Royal StatisticalSociety. Series B (Methodological), 1979.

[22] I. Sonin. A Generalized Gittins Index for a MarkovChain and Its Recursive Calculation. Statistics &Probability Letters, 2008.

[23] M. Isard, M. Budiu, Y. Yu, A. Birrell and D. Fet-terly. Dryad: Distributed Data-parallel Programsfrom Sequential Building Blocks. In ACM Eurosys,2007.

[24] W. Baek and T. Chilimbi. Green: a Framework forSupporting Energy-conscious Programming UsingControlled Approximation. In ACM Sigplan No-tices, 2010.

[25] Hive. http://wiki.apache.org/hadoop/Hive.

[26] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey,D. Shakib, S. Weaver, and J. Zhou. SCOPE:Easy and Efficient Parallel Processing of MassiveDatasets. In VLDB, 2008.

[27] M. Tokic and G. Palm. Value-difference BasedExploration: Adaptive Control between Epsilon-greedy and Softmax. In KI 2011: Advances in Ar-tificial Intelligence. Springer, 2011.

[28] J. Dean and S. Ghemawat. MapReduce: SimplifiedData Processing on Large Clusters. Communica-tions of the ACM, 2008.

[29] A. Baratloo, M. Karaul, Z. Kedem, and P. Wycko.Charlotte: Metacomputing on the Web. In 9thConference on Parallel and Distributed ComputingSystems, 1996.

13

[30] E. Korpela D. Anderson, J. Cobb. SETI@home:An Experiment in Public-Resource Computing. InComm. ACM, 2002.

[31] M. Rinard and P. Diniz. Commutativity Analysis:a New Analysis Framework for Parallelizing Com-pilers. In ACM PLDI, 1996.

[32] D. Paranhos, W. Cirne, and F. Brasileiro. Trad-ing Cycles for Information: Using Replication toSchedule Bag-of-Tasks Applications on Computa-tional Grids. In Euro-Par, 2003.

[33] G. Ghare and S. Leutenegger. Improving Speedupand Response Times by Replicating Parallel Pro-grams on a SNOW. In JSSPP, 2004.

[34] W. Cirne, D. Paranhos, F. Brasileiro, L. Goes, andW. Voorsluys. On the Efficacy, Efficiency andEmergent Behavior of Task Replication in LargeDistributed Systems. In Parallel Computing, 2007.

14

Documents

GRASS: Trimming Stragglers in Approximation Analyticsusers.cms.caltech.edu/~adamw/papers/straggler.pdfSpark [15] (for interactive jobs). We evaluate GRASS using production workloads