Upload
matteo-dellamico
View
286
Download
0
Embed Size (px)
DESCRIPTION
The proof that the best response time in queuing systems is obtained by scheduling the jobs with the shortest remaining processing time dates back to 1966; since then, other size-based scheduling protocols that pair near-optimal response times with strong fairness guarantees have been proposed. Yet, despite these very desirable properties, size-based scheduling policies are almost never used in practice: a key reason is that, in real systems, it is prohibitive to know a priori exact job sizes. In this talk, I will first describe our efforts to put in practice concepts coming from theory, developing HFSP: a size-based scheduler for Hadoop MapReduce that uses estimations rather than exact size information. We obtained results that were surprisingly good even with very inaccurate size estimations: this motivated us to return to theory, and perform an in-depth study of scheduling based on estimated sizes. We obtained very promising results: for a large class of workloads, size-based scheduling performs well even with very rough size estimations; for the other workloads, simple modifications to the existing scheduling protocols are sufficient to greatly enhance performance.
Citation preview
.
......
Size-Based Scheduling:From Theory To Practice, And Back
Matteo Dell’Amico
EURECOM24 April 2014
1
Credits
.
......
Joint work with
Pietro Michiardi,Mario Pastorelli (EURECOM)
Antonio Barbuzzi (ex EURECOM, now@VisualDNA, UK)
Damiano Carra (University of Verona, Italy)
2
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
3
Big Data and MapReduce
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
4
Big Data and MapReduce Big Data
Big Data: Definition
.
......Data that is too big for you to handle the way you normally do
.The 3 (+2) Vs..
......
Volume, Velocity, Variety
… plus Veracity and Value
.…But Still…..
......Why is everybody talking about Big Data now?
5
Big Data and MapReduce Big Data
Big Data: Definition
.
......Data that is too big for you to handle the way you normally do
.The 3 (+2) Vs..
......
Volume, Velocity, Variety
… plus Veracity and Value
.…But Still…..
......Why is everybody talking about Big Data now?
5
Big Data and MapReduce Big Data
Big Data: Definition
.
......Data that is too big for you to handle the way you normally do
.The 3 (+2) Vs..
......
Volume, Velocity, Variety
… plus Veracity and Value
.…But Still…..
......Why is everybody talking about Big Data now?
5
Big Data and MapReduce Big Data
Big Data: Why Now?
.1991: Maxtor 7040A..
......
40 MB
600-700 KB/s
One minute to read it all
.Now: Western Digital Caviar..
......
4 TB
128 MB/s
9 hours to read
6
Big Data and MapReduce Big Data
Big Data: Why Now?
.1991: Maxtor 7040A..
......
40 MB
600-700 KB/s
One minute to read it all
.Now: Western Digital Caviar..
......
4 TB
128 MB/s
9 hours to read6
Big Data and MapReduce Big Data
Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
.
......
Storage is cheap: we never throw away anything
Processing all that data is expensive
Moving it around is even worse
7
Big Data and MapReduce Big Data
Moore and His Brothers
.
......
Moore’s Law: processing power doubles every 18 months
Kryder’s Law: storage capacity doubles every year
Nielsen’s Law: bandwidth doubles every 21 months
.
......
Storage is cheap: we never throw away anything
Processing all that data is expensive
Moving it around is even worse
7
Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster.Map..
......
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
.Reduce..
......
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The Reduce function operates on all values for a single key
e.g., (green, [7, 42, 13, . . .])
8
Big Data and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across a cluster.Map..
......
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(red, 15) , (green, 7) , . . .]
.Reduce..
......
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The Reduce function operates on all values for a single key
e.g., (green, [7, 42, 13, . . .])
8
Big Data and MapReduce MapReduce
The ProblemWith Scheduling
.Current Workloads..
......
Huge job size variance
Running time: seconds to hoursI/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., VLDB ’13; Appuswamy et al., SOCC ’13]
.Consequence..
......
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
9
Size-Based Scheduling for MapReduce
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
10
Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100usage (%)
cluster
50
10 15 37.5 42.5 50
time(s)
100usage (%)
cluster
10 5020 30
50
time(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11
Size-Based Scheduling for MapReduce Size-Based Scheduling
Shortest Remaining Processing Time
100usage (%)
cluster
50
10 15 37.5 42.5 50
time(s)
100usage (%)
cluster
10 5020 30
50
time(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
11
Size-Based Scheduling for MapReduce Size-Based Scheduling
Size-Based Scheduling.Shortest Remaining Processing Time (SRPT)..
......
Minimizes average sojourn time (between job submission andcompletion)
.Fair Sojourn Protocol (FSP)..
......
Jobs are scheduled in the order they would complete if doingProcessor Sharing (PS)
Avoids starving large jobs
Fairness: jobs guaranteed to complete before Processor Sharing[Friedman & Henderson, SIGMETRICS ’03]
.Unknown Job size..
......…and what if we can only estimate job size?
12
Size-Based Scheduling for MapReduce Size-Based Scheduling
Multi-Processor Size-Based Scheduling
10 13 3923.5
usage (%)cluster
100
50
24.5
time(s)
10 13 20 23 39
100
50
usage (%)cluster
time(s)
job 1
job 2
job 3
job 1
job 2
job 3
13
Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell.Job Size Estimation..
......
Naive estimation at first
After the first s “training” tasks have run, we update its = 5 by default
On t task slots, we give priority to training taskst avoids starving “old” jobs“shortcut” for very small jobs
.Scheduling Policy..
......
We treat Map and Reduce phases as separate jobs
Virtual time: per-job simulated completion time
When a task slot frees up, we schedule one from the job thatcompletes earlier in the virtual time
14
Size-Based Scheduling for MapReduce HFSP Implementation
HFSP In A Nutshell.Job Size Estimation..
......
Naive estimation at first
After the first s “training” tasks have run, we update its = 5 by default
On t task slots, we give priority to training taskst avoids starving “old” jobs“shortcut” for very small jobs
.Scheduling Policy..
......
We treat Map and Reduce phases as separate jobs
Virtual time: per-job simulated completion time
When a task slot frees up, we schedule one from the job thatcompletes earlier in the virtual time
14
Size-Based Scheduling for MapReduce HFSP Implementation
Job Size Estimation.Initial Estimation..
......
k · lk: # of tasks
l: average size of past Map/Reduce tasks
.Second Estimation..
......
After the s samples have run, compute an l′ as the average size ofthe sample tasks
timeout (60 s by default): if tasks are not completed by then, useprogress %
Predicted job size: k · l′
15
Size-Based Scheduling for MapReduce HFSP Implementation
Virtual Time
.
......
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completiontime, based on
number of tasks per jobavailable task slots in the real cluster
Simulation is updated when
new jobs arrivetasks complete
16
Size-Based Scheduling for MapReduce Experiments
Experimental Setup.Platform..
......36 machines with 4 CPUs, 16 GB RAM
.Workloads..
......
Generated with the PigMix benchmark: realistic operations onsynthetic dataData sizes inspired by known measurements [Chen et al., VLDB ’12; Renet al., VLDB ’13]
.Configuration..
......
We compare to Hadoop’s FAIR scheduler
similar to processor-sharing
Delay scheduling enabled both for FAIR and HFSP
17
Size-Based Scheduling for MapReduce Experiments
Sojourn Time
101 102 103
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0E
CD
F
HFSP
FAIR
101 102 103 104
Sojourn Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
EC
DF
HFSP
FAIR
“small” workload: ~16% better “large” workload: ~75% better
Sojourn time: time that passes between the moment a job issubmitted and it terminates
With higher load, the scheduler becomes decisive
Analogous results on different platform & different workload
18
Size-Based Scheduling for MapReduce Experiments
Job Size Estimation
0.25 0.5 1 2 4
Error
0.0
0.2
0.4
0.6
0.8
1.0
EC
DF
MAP
REDUCE
Error:real size
estimated sizeFits a log-normal distributionThe estimation isn’t even that good! Why does HFSP work thatwell?
19
Size-Based Scheduling With Errors
Outline
...1 Big Data and MapReduce
...2 Size-Based Scheduling for MapReduce
...3 Size-Based Scheduling With Errors
20
Size-Based Scheduling With Errors Scheduling Simulation
Scheduling Simulation
How does size-based scheduling behave in presence of errors?
Lu et al. (MASCOTS 2004) suggestmuch worse results
We wrote a simulator to understand better, with Hadoop-likeworkloads [Chen et al., VLDB ’12]
written in Python, efficient and easy to prototype new schedulers
21
Size-Based Scheduling With Errors Scheduling Simulation
Log-Normal Error Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
sigma= 0.125sigma= 0.25sigma= 1sigma= 4
Error:real size
estimated size22
Size-Based Scheduling With Errors Scheduling Simulation
Weibull Job Size Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0x
0.0
0.5
1.0
1.5
2.0
shape= 0.125shape= 1shape= 2shape= 4
Interpolates betweenheavy-tailed job size distributions (sigma<1)exponential distributions (sigma=1)bell-shaped distributions (sigma>1) 23
Size-Based Scheduling With Errors Scheduling Simulation
Size-Based Scheduling With Errors
shape
0.125 0.25 0.5 12
4
sigma
0.1250.25
0.51
24
MST
/M
ST(P
S)
0.250.51248163264128
shape
0.125 0.25 0.5 12
4
sigma
0.1250.25
0.51
24
MST
/M
ST(P
S)
0.250.51248163264128
SRPT FSP
Problems for heavy-tailed job size distributions
Otherwise, size-based schedulingworks very well
24
Size-Based Scheduling With Errors Scheduling Simulation
Over-Estimations and Under-Estimations
Over-‐es'ma'on Under-‐es'ma'on
t
t
t
t
Remaining size
Remaining size
Remaining size
Remaining size
J1 J2 J3
J2 J3
J1 ^
J4
J5 J6
J4 J5 J6
^
Under-estimations canwreak havocwith heavy-tailedworkloads
25
Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS
.Idea..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimationerror
The scheduler can realize this, and take corrective action
.Realization..
......
To avoid that late jobs block the system, just do processorsharing between them instead of scheduling the ”most late” one
26
Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS
.Idea..
......
Without errors, real jobs always complete before virtual ones
When they don’t (they are late), there has been an estimationerror
The scheduler can realize this, and take corrective action
.Realization..
......
To avoid that late jobs block the system, just do processorsharing between them instead of scheduling the ”most late” one
26
Size-Based Scheduling With Errors Scheduling Simulation
FSP + PS: Results
shape
0.125 0.25 0.5 12
4
sigma
0.1250.25
0.51
24
MST
/M
ST(P
S)
0.250.51248163264128
shape
0.125 0.25 0.5 12
4
sigma
0.1250.25
0.51
24
MST
/M
ST(P
S)
0.250.51248163264128
FSP FSP + PS
27
Size-Based Scheduling With Errors Scheduling Simulation
Take-Home Messages.
......
Size-based scheduling on Hadoop is viable, and particularlyappealing for companies with (semi-)interactive jobs and smallerclusters.
......
Schedulers like HFSP (in practice) and FSP+PS (in theory) are robustwith respect to errors
therefore, simple rough estimations are sufficient
HFSP is available as free software athttp://github.com/bigfootproject/hfspScheduling simulator athttps://bitbucket.org/bigfootproject/schedsimHFSP: published at IEEE BIGDATA 2013scheduling simulator and FSP+PS: under submission, available athttp://arxiv.org/abs/1403.5996
28
Bonus Content Comparison with SRPT
Schedulers vs. SRPT
0.125 0.25 0.5 1 2 4shape
2
4
6
8
10
MST
/M
ST(S
RPT
) SRPTEFSPEFSPE+PS
PSLASFIFO
29
Bonus Content Real Workloads
0.125 0.25 0.5 1 2 4sigma
2
4
6
8
10
MST
/M
ST(S
RPT
) SRPTEFSPEFSPE+PS
PSLAS
0.125 0.25 0.5 1 2 4sigma
2
4
6
8
10
MST
/M
ST(S
RPT
) SRPTEFSPEFSPE+PS
PSLAS
Synthetic workload (shape=0.25) Facebook Hadoop Cluster
30
Bonus Content Real Workloads
Web Cache
0.125 0.25 0.5 1 2 4sigma
1
10
100
MST
/M
ST(S
RPT
) SRPTEFSPEFSPE+PS
PSLAS
0.125 0.25 0.5 1 2 4sigma
1
10
100
1000
10000
MST
/M
ST(S
RPT
) SRPTEFSPEFSPE+PS
PSLASFIFO
Synthetic workload (shape=0.177) IRCache Web Cache
31
Bonus Content Job Preemption
Job Preemption.Supported in Hadoop..
......
Kill running tasks
wastes work
Wait for them to finish
may take long
.Our Choice..
......
Map tasks: Wait
generally small
For Reduce tasks, we implemented Suspend and Resume
avoids the drawbacks of both Wait and Kill
32
Bonus Content Job Preemption
Job Preemption.Supported in Hadoop..
......
Kill running tasks
wastes work
Wait for them to finish
may take long
.Our Choice..
......
Map tasks: Wait
generally small
For Reduce tasks, we implemented Suspend and Resume
avoids the drawbacks of both Wait and Kill
32
Bonus Content Job Preemption
Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and whenmemory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Waithard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to finish latermay have smaller memory footprint
33
Bonus Content Job Preemption
Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and whenmemory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Waithard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to finish latermay have smaller memory footprint
33
Bonus Content Job Preemption
Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and whenmemory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Waithard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to finish latermay have smaller memory footprint
33
Bonus Content Job Preemption
Job Preemption: Suspend and Resume.Our Solution........We delegate to the OS: SIGSTOP and SIGCONT
.
......
The OS will swap tasks if and whenmemory is needed
no risk of thrashing: swapped data is loaded only when resuming
.
......
Configurable maximum number of suspended tasks
if reached, switch to Waithard limit on memory allocated to suspended tasks
.
......
Between preemptable running tasks, suspend the youngest
likely to finish latermay have smaller memory footprint
33