45
. . Size-Based Scheduling: From Theory To Practice, And Back Matteo Dell’Amico EURECOM 24 April 2014 1

Size-Based Scheduling: From Theory To Practice, And Back

Embed Size (px)

DESCRIPTION

The proof that the best response time in queuing systems is obtained by scheduling the jobs with the shortest remaining processing time dates back to 1966; since then, other size-based scheduling protocols that pair near-optimal response times with strong fairness guarantees have been proposed. Yet, despite these very desirable properties, size-based scheduling policies are almost never used in practice: a key reason is that, in real systems, it is prohibitive to know a priori exact job sizes. In this talk, I will first describe our efforts to put in practice concepts coming from theory, developing HFSP: a size-based scheduler for Hadoop MapReduce that uses estimations rather than exact size information. We obtained results that were surprisingly good even with very inaccurate size estimations: this motivated us to return to theory, and perform an in-depth study of scheduling based on estimated sizes. We obtained very promising results: for a large class of workloads, size-based scheduling performs well even with very rough size estimations; for the other workloads, simple modifications to the existing scheduling protocols are sufficient to greatly enhance performance.

Citation preview

Page 1: Size-Based Scheduling: From Theory To Practice, And Back

.

......

Size-Based Scheduling:From Theory To Practice, And Back

Matteo Dell’Amico

EURECOM24 April 2014

1

Page 2: Size-Based Scheduling: From Theory To Practice, And Back

Credits

.

......

Joint work with

Pietro Michiardi,Mario Pastorelli (EURECOM)

Antonio Barbuzzi (ex EURECOM, now@VisualDNA, UK)

Damiano Carra (University of Verona, Italy)

2

Page 3: Size-Based Scheduling: From Theory To Practice, And Back

Outline

...1 Big Data and MapReduce

...2 Size-Based Scheduling for MapReduce

...3 Size-Based Scheduling With Errors

3

Page 4: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce

Outline

...1 Big Data and MapReduce

...2 Size-Based Scheduling for MapReduce

...3 Size-Based Scheduling With Errors

4

Page 5: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Big Data: Definition

.

......Data that is too big for you to handle the way you normally do

.The 3 (+2) Vs..

......

Volume, Velocity, Variety

… plus Veracity and Value

.…But Still…..

......Why is everybody talking about Big Data now?

5

Page 6: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Big Data: Definition

.

......Data that is too big for you to handle the way you normally do

.The 3 (+2) Vs..

......

Volume, Velocity, Variety

… plus Veracity and Value

.…But Still…..

......Why is everybody talking about Big Data now?

5

Page 7: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Big Data: Definition

.

......Data that is too big for you to handle the way you normally do

.The 3 (+2) Vs..

......

Volume, Velocity, Variety

… plus Veracity and Value

.…But Still…..

......Why is everybody talking about Big Data now?

5

Page 8: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Big Data: Why Now?

.1991: Maxtor 7040A..

......

40 MB

600-700 KB/s

One minute to read it all

.Now: Western Digital Caviar..

......

4 TB

128 MB/s

9 hours to read

6

Page 9: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Big Data: Why Now?

.1991: Maxtor 7040A..

......

40 MB

600-700 KB/s

One minute to read it all

.Now: Western Digital Caviar..

......

4 TB

128 MB/s

9 hours to read6

Page 10: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Moore and His Brothers

.

......

Moore’s Law: processing power doubles every 18 months

Kryder’s Law: storage capacity doubles every year

Nielsen’s Law: bandwidth doubles every 21 months

.

......

Storage is cheap: we never throw away anything

Processing all that data is expensive

Moving it around is even worse

7

Page 11: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce Big Data

Moore and His Brothers

.

......

Moore’s Law: processing power doubles every 18 months

Kryder’s Law: storage capacity doubles every year

Nielsen’s Law: bandwidth doubles every 21 months

.

......

Storage is cheap: we never throw away anything

Processing all that data is expensive

Moving it around is even worse

7

Page 12: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce MapReduce

MapReduce

Bring the computation to the data – split in blocks across a cluster.Map..

......

One task per block

Hadoop filesystem (HDFS): 64 MB by default

Stores locally key-value pairs

e.g., for word count: [(red, 15) , (green, 7) , . . .]

.Reduce..

......

# of tasks set by the programmer

Mapper output is partitioned by key and pulled from “mappers”

The Reduce function operates on all values for a single key

e.g., (green, [7, 42, 13, . . .])

8

Page 13: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce MapReduce

MapReduce

Bring the computation to the data – split in blocks across a cluster.Map..

......

One task per block

Hadoop filesystem (HDFS): 64 MB by default

Stores locally key-value pairs

e.g., for word count: [(red, 15) , (green, 7) , . . .]

.Reduce..

......

# of tasks set by the programmer

Mapper output is partitioned by key and pulled from “mappers”

The Reduce function operates on all values for a single key

e.g., (green, [7, 42, 13, . . .])

8

Page 14: Size-Based Scheduling: From Theory To Practice, And Back

Big Data and MapReduce MapReduce

The ProblemWith Scheduling

.Current Workloads..

......

Huge job size variance

Running time: seconds to hoursI/O: KBs to TBs

[Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13]

.Consequence..

......

Interactive jobs are delayed by long ones

In smaller clusters long queues exacerbate the problem

9

Page 15: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce

Outline

...1 Big Data and MapReduce

...2 Size-Based Scheduling for MapReduce

...3 Size-Based Scheduling With Errors

10

Page 16: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Size-Based Scheduling

Shortest Remaining Processing Time

100usage (%)

cluster

50

10 15 37.5 42.5 50

time(s)

100usage (%)

cluster

10 5020 30

50

time(s)

job 1

job 2

job 3

job 1 job 3job 2 job 1

11

Page 17: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Size-Based Scheduling

Shortest Remaining Processing Time

100usage (%)

cluster

50

10 15 37.5 42.5 50

time(s)

100usage (%)

cluster

10 5020 30

50

time(s)

job 1

job 2

job 3

job 1 job 3job 2 job 1

11

Page 18: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Size-Based Scheduling

Size-Based Scheduling.Shortest Remaining Processing Time (SRPT)..

......

Minimizes average sojourn time (between job submission andcompletion)

.Fair Sojourn Protocol (FSP)..

......

Jobs are scheduled in the order they would complete if doingProcessor Sharing (PS)

Avoids starving large jobs

Fairness: jobs guaranteed to complete before Processor Sharing[Friedman & Henderson, SIGMETRICS ’03]

.Unknown Job size..

......…and what if we can only estimate job size?

12

Page 19: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Size-Based Scheduling

Multi-Processor Size-Based Scheduling

10 13 3923.5

usage (%)cluster

100

50

24.5

time(s)

10 13 20 23 39

100

50

usage (%)cluster

time(s)

job 1

job 2

job 3

job 1

job 2

job 3

13

Page 20: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce HFSP Implementation

HFSP In A Nutshell.Job Size Estimation..

......

Naive estimation at first

After the first s “training” tasks have run, we update its = 5 by default

On t task slots, we give priority to training taskst avoids starving “old” jobs“shortcut” for very small jobs

.Scheduling Policy..

......

We treat Map and Reduce phases as separate jobs

Virtual time: per-job simulated completion time

When a task slot frees up, we schedule one from the job thatcompletes earlier in the virtual time

14

Page 21: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce HFSP Implementation

HFSP In A Nutshell.Job Size Estimation..

......

Naive estimation at first

After the first s “training” tasks have run, we update its = 5 by default

On t task slots, we give priority to training taskst avoids starving “old” jobs“shortcut” for very small jobs

.Scheduling Policy..

......

We treat Map and Reduce phases as separate jobs

Virtual time: per-job simulated completion time

When a task slot frees up, we schedule one from the job thatcompletes earlier in the virtual time

14

Page 22: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce HFSP Implementation

Job Size Estimation.Initial Estimation..

......

k · lk: # of tasks

l: average size of past Map/Reduce tasks

.Second Estimation..

......

After the s samples have run, compute an l′ as the average size ofthe sample tasks

timeout (60 s by default): if tasks are not completed by then, useprogress %

Predicted job size: k · l′

15

Page 23: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce HFSP Implementation

Virtual Time

.

......

Estimated job size is in a “serialized” single-machine format

Simulates a processor-sharing cluster to compute completiontime, based on

number of tasks per jobavailable task slots in the real cluster

Simulation is updated when

new jobs arrivetasks complete

16

Page 24: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Experiments

Experimental Setup.Platform..

......36 machines with 4 CPUs, 16 GB RAM

.Workloads..

......

Generated with the PigMix benchmark: realistic operations onsynthetic dataData sizes inspired by known measurements [Chen et al., VLDB ’12; Renet al., VLDB ’13]

.Configuration..

......

We compare to Hadoop’s FAIR scheduler

similar to processor-sharing

Delay scheduling enabled both for FAIR and HFSP

17

Page 25: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Experiments

Sojourn Time

101 102 103

Sojourn Time (s)

0.0

0.2

0.4

0.6

0.8

1.0E

CD

F

HFSP

FAIR

101 102 103 104

Sojourn Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

HFSP

FAIR

“small” workload: ~16% better “large” workload: ~75% better

Sojourn time: time that passes between the moment a job issubmitted and it terminates

With higher load, the scheduler becomes decisive

Analogous results on different platform & different workload

18

Page 26: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling for MapReduce Experiments

Job Size Estimation

0.25 0.5 1 2 4

Error

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

MAP

REDUCE

Error:real size

estimated sizeFits a log-normal distributionThe estimation isn’t even that good! Why does HFSP work thatwell?

19

Page 27: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors

Outline

...1 Big Data and MapReduce

...2 Size-Based Scheduling for MapReduce

...3 Size-Based Scheduling With Errors

20

Page 28: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

Scheduling Simulation

How does size-based scheduling behave in presence of errors?

Lu et al. (MASCOTS 2004) suggestmuch worse results

We wrote a simulator to understand better, with Hadoop-likeworkloads [Chen et al., VLDB ’12]

written in Python, efficient and easy to prototype new schedulers

21

Page 29: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

Log-Normal Error Distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.0x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

PDF

sigma= 0.125sigma= 0.25sigma= 1sigma= 4

Error:real size

estimated size22

Page 30: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

Weibull Job Size Distribution

0.0 0.5 1.0 1.5 2.0 2.5 3.0x

0.0

0.5

1.0

1.5

2.0

PDF

shape= 0.125shape= 1shape= 2shape= 4

Interpolates betweenheavy-tailed job size distributions (sigma<1)exponential distributions (sigma=1)bell-shaped distributions (sigma>1) 23

Page 31: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

Size-Based Scheduling With Errors

shape

0.125 0.25 0.5 12

4

sigma

0.1250.25

0.51

24

MST

/M

ST(P

S)

0.250.51248163264128

shape

0.125 0.25 0.5 12

4

sigma

0.1250.25

0.51

24

MST

/M

ST(P

S)

0.250.51248163264128

SRPT FSP

Problems for heavy-tailed job size distributions

Otherwise, size-based schedulingworks very well

24

Page 32: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

Over-Estimations and Under-Estimations

Over-­‐es'ma'on   Under-­‐es'ma'on  

t  

t  

t  

t  

Remaining  size  

Remaining  size  

Remaining  size  

Remaining  size  

J1   J2  J3  

J2  J3  

J1  ^  

J4  

J5  J6  

J4   J5  J6  

^  

Under-estimations canwreak havocwith heavy-tailedworkloads

25

Page 33: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

FSP + PS

.Idea..

......

Without errors, real jobs always complete before virtual ones

When they don’t (they are late), there has been an estimationerror

The scheduler can realize this, and take corrective action

.Realization..

......

To avoid that late jobs block the system, just do processorsharing between them instead of scheduling the ”most late” one

26

Page 34: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

FSP + PS

.Idea..

......

Without errors, real jobs always complete before virtual ones

When they don’t (they are late), there has been an estimationerror

The scheduler can realize this, and take corrective action

.Realization..

......

To avoid that late jobs block the system, just do processorsharing between them instead of scheduling the ”most late” one

26

Page 35: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

FSP + PS: Results

shape

0.125 0.25 0.5 12

4

sigma

0.1250.25

0.51

24

MST

/M

ST(P

S)

0.250.51248163264128

shape

0.125 0.25 0.5 12

4

sigma

0.1250.25

0.51

24

MST

/M

ST(P

S)

0.250.51248163264128

FSP FSP + PS

27

Page 36: Size-Based Scheduling: From Theory To Practice, And Back

Size-Based Scheduling With Errors Scheduling Simulation

Take-Home Messages.

......

Size-based scheduling on Hadoop is viable, and particularlyappealing for companies with (semi-)interactive jobs and smallerclusters.

......

Schedulers like HFSP (in practice) and FSP+PS (in theory) are robustwith respect to errors

therefore, simple rough estimations are sufficient

HFSP is available as free software athttp://github.com/bigfootproject/hfspScheduling simulator athttps://bitbucket.org/bigfootproject/schedsimHFSP: published at IEEE BIGDATA 2013scheduling simulator and FSP+PS: under submission, available athttp://arxiv.org/abs/1403.5996

28

Page 37: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Comparison with SRPT

Schedulers vs. SRPT

0.125 0.25 0.5 1 2 4shape

2

4

6

8

10

MST

/M

ST(S

RPT

) SRPTEFSPEFSPE+PS

PSLASFIFO

29

Page 38: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Real Workloads

Facebook

0.125 0.25 0.5 1 2 4sigma

2

4

6

8

10

MST

/M

ST(S

RPT

) SRPTEFSPEFSPE+PS

PSLAS

0.125 0.25 0.5 1 2 4sigma

2

4

6

8

10

MST

/M

ST(S

RPT

) SRPTEFSPEFSPE+PS

PSLAS

Synthetic workload (shape=0.25) Facebook Hadoop Cluster

30

Page 39: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Real Workloads

Web Cache

0.125 0.25 0.5 1 2 4sigma

1

10

100

MST

/M

ST(S

RPT

) SRPTEFSPEFSPE+PS

PSLAS

0.125 0.25 0.5 1 2 4sigma

1

10

100

1000

10000

MST

/M

ST(S

RPT

) SRPTEFSPEFSPE+PS

PSLASFIFO

Synthetic workload (shape=0.177) IRCache Web Cache

31

Page 40: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Job Preemption

Job Preemption.Supported in Hadoop..

......

Kill running tasks

wastes work

Wait for them to finish

may take long

.Our Choice..

......

Map tasks: Wait

generally small

For Reduce tasks, we implemented Suspend and Resume

avoids the drawbacks of both Wait and Kill

32

Page 41: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Job Preemption

Job Preemption.Supported in Hadoop..

......

Kill running tasks

wastes work

Wait for them to finish

may take long

.Our Choice..

......

Map tasks: Wait

generally small

For Reduce tasks, we implemented Suspend and Resume

avoids the drawbacks of both Wait and Kill

32

Page 42: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Job Preemption

Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT

.

......

The OS will swap tasks if and whenmemory is needed

no risk of thrashing: swapped data is loaded only when resuming

.

......

Configurable maximum number of suspended tasks

if reached, switch to Waithard limit on memory allocated to suspended tasks

.

......

Between preemptable running tasks, suspend the youngest

likely to finish latermay have smaller memory footprint

33

Page 43: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Job Preemption

Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT

.

......

The OS will swap tasks if and whenmemory is needed

no risk of thrashing: swapped data is loaded only when resuming

.

......

Configurable maximum number of suspended tasks

if reached, switch to Waithard limit on memory allocated to suspended tasks

.

......

Between preemptable running tasks, suspend the youngest

likely to finish latermay have smaller memory footprint

33

Page 44: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Job Preemption

Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT

.

......

The OS will swap tasks if and whenmemory is needed

no risk of thrashing: swapped data is loaded only when resuming

.

......

Configurable maximum number of suspended tasks

if reached, switch to Waithard limit on memory allocated to suspended tasks

.

......

Between preemptable running tasks, suspend the youngest

likely to finish latermay have smaller memory footprint

33

Page 45: Size-Based Scheduling: From Theory To Practice, And Back

Bonus Content Job Preemption

Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT

.

......

The OS will swap tasks if and whenmemory is needed

no risk of thrashing: swapped data is loaded only when resuming

.

......

Configurable maximum number of suspended tasks

if reached, switch to Waithard limit on memory allocated to suspended tasks

.

......

Between preemptable running tasks, suspend the youngest

likely to finish latermay have smaller memory footprint

33