140
Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca

Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey Guaranteed Job Latency in

Data Parallel Clusters

Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca

Page 2: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

2  

DATA PARALLEL CLUSTERS

Page 3: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

3  

DATA PARALLEL CLUSTERS

Page 4: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

4  

DATA PARALLEL CLUSTERS Predictability

Page 5: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

5  

DATA PARALLEL CLUSTERS Deadline

Page 6: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

6  

DATA PARALLEL CLUSTERS Deadline

Page 7: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

7  

VARIABLE LATENCY

Page 8: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

8  

VARIABLE LATENCY

0 5 10 15 20 25 30 35 40

latency [minutes]

Page 9: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

9  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

Page 10: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

10  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

Page 11: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

11  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

Page 12: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

12  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

Page 13: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

13  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

Page 14: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

14  

VARIABLE LATENCY

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

CDF

latency [minutes]

4.3x

Page 15: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

15  

Why does latency vary?

1.  Pipeline complexity 2.  Noisy execution environment

Page 16: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Cosmos

16  

MICROSOFT’S DATA PARALLEL CLUSTERS

Page 17: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Cosmos

17  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore

Page 18: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Cosmos

18  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore •  Dryad

Page 19: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Cosmos

19  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore •  Dryad •  SCOPE

Page 20: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Cosmos

20  

MICROSOFT’S DATA PARALLEL CLUSTERS

•  CosmosStore •  Dryad •  SCOPE

Page 21: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

21  

DRYAD’S DAG WORKFLOW

Cosm

os Cl

uste

r

Page 22: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

22  

DRYAD’S DAG WORKFLOW

Cosm

os Cl

uste

r

Page 23: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

23  

DRYAD’S DAG WORKFLOW

Pipeline

Job

Page 24: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

24  

DRYAD’S DAG WORKFLOW

Deadline

Deadline

Deadline Deadline

Deadline

Page 25: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

25  

DRYAD’S DAG WORKFLOW

Deadline

Deadline

Deadline Deadline

Deadline

Page 26: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

26  

Stage

DRYAD’S DAG WORKFLOW

Job

Page 27: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

27  

Stage

DRYAD’S DAG WORKFLOW

Tasks

Job

Page 28: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

28  

Page 29: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

29  

EXPRESSING PERFORMANCE TARGETS

Priorities?

Page 30: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

30  

EXPRESSING PERFORMANCE TARGETS

Priorities? Not expressive enough

Weights?

Page 31: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

31  

EXPRESSING PERFORMANCE TARGETS

Priorities? Not expressive enough

Weights? Difficult for users to set

Utility curves?

Page 32: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

32  

EXPRESSING PERFORMANCE TARGETS

Priorities? Not expressive enough

Weights? Difficult for users to set

Utility curves? Capture deadline & penalty

Page 33: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

33  

OUR GOAL

Page 34: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

34  

OUR GOAL

Maximize utility

Page 35: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

35  

OUR GOAL

Maximize utility while minimizing resources

Page 36: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

36  

OUR GOAL

Maximize utility while minimizing resources

by dynamically adjusting the allocation

Page 37: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey 37  

Page 38: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey 38  

•  Large clusters

Page 39: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey 39  

•  Large clusters •  Many users

Page 40: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey 40  

•  Large clusters •  Many users •  Prior execution

Page 41: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

41  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

Page 42: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

42  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

Page 43: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

43  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

Page 44: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

44  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

Page 45: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

45  

JOCKEY – CONTROL LOOP

Page 46: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

46  

JOCKEY – CONTROL LOOP

Page 47: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

47  

JOCKEY – CONTROL LOOP

Page 48: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

48  

JOCKEY – MODEL

f( job state, allocation) -> remaining run time

Page 49: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

49  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

f( job state, allocation) -> remaining run time

Page 50: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

50  

JOCKEY – PROGRESS INDICATOR

Page 51: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

51  

JOCKEY – PROGRESS INDICATOR

Page 52: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

52  

JOCKEY – PROGRESS INDICATOR

Page 53: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

53  

JOCKEY – PROGRESS INDICATOR

total running

Page 54: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

54  

JOCKEY – PROGRESS INDICATOR

total running +

total queuing

Page 55: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

55  

JOCKEY – PROGRESS INDICATOR

stage

total running +

total queuing

Page 56: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

56  

JOCKEY – PROGRESS INDICATOR

total running +

total queuing

total running +

total queuing

total running +

total queuing

Stage 1

Stage 2

Stage 3

+

+

Page 57: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

57  

JOCKEY – PROGRESS INDICATOR

total running +

total queuing

total running +

total queuing

total running +

total queuing

# complete total tasks

# complete total tasks

# complete total tasks

Stage 1

Stage 2

Stage 3

+

+

Page 58: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

58  

JOCKEY – PROGRESS INDICATOR

Page 59: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

59  

JOCKEY – PROGRESS INDICATOR

0 10 20 30 40 50 60

time [min]

Page 60: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

60  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

time [min]

Page 61: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

61  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

time [min]

Page 62: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

62  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

estim

ated

job

com

plet

ion

[min

]

time [min]

Page 63: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

63  

JOCKEY – PROGRESS INDICATOR

0

20

40

60

80

100

0

20

40

60

80

100

0 10 20 30 40 50 60

job

prog

ress

estim

ated

job

com

plet

ion

[min

]

time [min]

Page 64: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

64  

JOCKEY – CONTROL LOOP

Page 65: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

65  

JOCKEY – CONTROL LOOP

1%  complete  

2%  complete  

3%  complete  

4%  complete  

5%  complete  

Job model

Page 66: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

66  

JOCKEY – CONTROL LOOP

10  nodes   20  nodes   30  nodes  

1%  complete  

2%  complete  

3%  complete  

4%  complete  

5%  complete  

Job model

Page 67: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

67  

JOCKEY – CONTROL LOOP

10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Job model

Page 68: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

68  

JOCKEY – CONTROL LOOP

10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Job model

Deadline: 50 min.

Completion: 1%

Page 69: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

69  

JOCKEY – CONTROL LOOP

Job model

Deadline: 50 min.

Completion: 1%

10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Page 70: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

70  

JOCKEY – CONTROL LOOP

Job model 10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Deadline: 40 min.

Completion: 3%

Page 71: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

71  

JOCKEY – CONTROL LOOP

Job model 10  nodes   20  nodes   30  nodes  

1%  complete   60  minutes   40  minutes   25  minutes  

2%  complete   59  minutes   39  minutes   24  minutes  

3%  complete   58  minutes   37  minutes   22  minutes  

4%  complete   56  minutes   36  minutes   21  minutes  

5%  complete   54  minutes   34  minutes   20  minutes  

Deadline: 30 min.

Completion: 5%

Page 72: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

72  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

Page 73: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

73  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

analytic model?

Page 74: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

74  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

analytic model? machine learning?

Page 75: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

75  

JOCKEY – MODEL

f(progress, allocation) -> remaining run time

analytic model? machine learning?

simulator

Page 76: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

76  

JOCKEY

Problem Solution

Page 77: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

77  

JOCKEY

Problem Solution

Pipeline complexity

Page 78: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

78  

JOCKEY

Problem Solution

Pipeline complexity Use a simulator

Page 79: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

79  

JOCKEY

Problem Solution

Pipeline complexity Use a simulator

Noisy environment

Page 80: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

80  

JOCKEY

Problem Solution

Pipeline complexity Use a simulator

Noisy environment Dynamic control

Page 81: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action 81  

Page 82: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action 82  

•  Real job

Page 83: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action 83  

•  Real job •  Production cluster

Page 84: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action 84  

•  Real job •  Production cluster •  CPU load: ~80%

Page 85: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

85  

Page 86: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

86  

Page 87: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

87  

Page 88: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

88  

Initial deadline: 140 minutes

Page 89: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

89  

New deadline: 70 minutes

Page 90: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

90  

New deadline: 70 minutes

Release resources due to excess pessimism

Page 91: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

91  

“Oracle” allocation: Total allocation-hours

Deadline

Page 92: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

92  

“Oracle” allocation: Total allocation-hours

Deadline

Available parallelism less than allocation

Page 93: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey in Action

93  

“Oracle” allocation: Total allocation-hours

Deadline

Allocation above oracle

Page 94: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation 94  

Page 95: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation 95  

•  Production cluster

Page 96: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation 96  

•  Production cluster •  21 jobs

Page 97: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation 97  

•  Production cluster •  21 jobs •  SLO met?

Page 98: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation 98  

•  Production cluster •  21 jobs •  SLO met? •  Cluster impact?

Page 99: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

99  

Page 100: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

job completion time relative to deadline

deadline

Jobs which met the SLO

100  

Page 101: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

deadline

Jobs which met the SLO

101  

Page 102: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

102  

Missed 1 of 94 deadlines

Page 103: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

103  

Page 104: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

104  

Page 105: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

Jockey

deadline

Jobs which met the SLO

105  

1.4x

Page 106: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

max allocation Jockey

deadline

Jobs which met the SLO

106  

Allocated too many resources

Missed 1 of 94 deadlines

Page 107: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

max allocation Jockey

Allocation fromsimulator

deadline

Jobs which met the SLO Allocated too many resources

107  

Simulator made good predictions: 80% finish before deadline

Missed 1 of 94 deadlines

Page 108: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

0%

20%

40%

60%

80%

100%

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130%

CD

F

job completion time relative to deadline

max allocation Jockey

Allocation fromsimulator

Control loop only

deadline

Jobs which met the SLO Allocated too many resources

Simulator made good predictions: 80% finish before deadline

108  

Control loop is stable

and successful

Missed 1 of 94 deadlines

Page 109: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

109  

Page 110: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

110  

0% 25% 50% 75% 100%

fraction of allocation above oracle

Page 111: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

111  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

Page 112: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

112  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

Jockey

Page 113: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

113  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

max allocation

Jockey

Page 114: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Evaluation

114  

0%

5%

10%

15%

20%

0% 25% 50% 75% 100%

fract

ion

of d

eadl

ines

mis

sed

fraction of allocation above oracle

Allocation from simulator

max allocation

Control loop only

Jockey

Page 115: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Conclusion 115  

Page 116: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

116  

Data parallel jobs are complex,

Page 117: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

117  

Data parallel jobs are complex, yet users demand deadlines.

Page 118: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

118  

Data parallel jobs are complex, yet users demand deadlines.

Jobs run in shared, noisy clusters,

Page 119: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

119  

Data parallel jobs are complex, yet users demand deadlines.

Jobs run in shared, noisy clusters, making simple models inaccurate.

Page 120: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey 120  

Page 121: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

simulator

121  

Page 122: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

control-loop

122  

Page 123: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

123  

Deadline

Deadline

Deadline Deadline

Deadline

Page 124: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

124  

Page 125: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Questions? Andrew Ferguson [email protected]

125  

Page 126: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Co-a

utho

rs •  Peter Bodík

(Microsoft Research) •  Srikanth Kandula

(Microsoft Research) •  Eric Boutín

(Microsoft) •  Rodrigo Fonseca

(Brown)

Questions? 126  

Andrew Ferguson [email protected]

Page 127: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Backup Slides

127  

Page 128: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

!"

# $%&# '#!"#$%"&'()*+",$*+&)(

Utility Curves

Deadline

For single jobs, scale doesn’t matter

For multiple jobs, use financial penalties

128  

Page 129: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

129  

Jockey Resource allocation control loop

1. Slack

2. Hysteresis

3. Dead Zone

Prediction Run Time Utility

Page 130: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

130  

Cosmos

•  Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler)

•  Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion.

•  A token is a guaranteed share of CPU and memory •  To increase efficiency, unused tokens are re-allocated to

jobs with available work

Resource sharing in Cosmos

Page 131: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

131  

Jockey Progress indicator •  Can use many features of the job to build a progress

indicator •  Earlier work (ParaTimer) concentrated on fraction of tasks

completed •  Our indicator is very simple, but we found it performs

best for Jockey’s needs Total vertex initialization time

Total vertex run time Frac;on  of  completed  ver;ces  

Page 132: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

132  

Comparison with ARIA •  ARIA uses analytic models •  Designed for 3 stages: Map, Shuffle, Reduce •  Jockey’s control loop is robust due to control-

theory improvements •  ARIA tested on small (66-node) cluster without a

network bottleneck •  We believe Jockey is a better match for production

DAG frameworks such as Hive, Pig, etc.

Page 133: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

133  

Jockey

Latency prediction: C(p, a) •  Event-based simulator

–  Same scheduling logic as actual Job Manager

–  Captures important features of job progress

–  Does not model input size variation or speculative re-execution of stragglers

–  Inputs: job algebra, distributions of task timings, probabilities of failures, allocation

•  Analytic model

–  Inspired by Amdahl’s Law: T = S + P/N

–  S is remaining work on critical path, P is all remaining work, N is number of machines

Page 134: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

134  

Jockey Resource allocation control loop •  Executes in Dryad’s Job Manager

•  Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup)

•  Output: Number of tokens to allocate

•  Improved with techniques from control-theory

Page 135: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey offline during job runtime

job profile

135  

Page 136: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey

simulator

offline during job runtime

job profile

136  

Page 137: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey

simulator

offline during job runtime

job stats

job profile

137  

Page 138: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey

simulator

offline during job runtime

job stats latency predictions

job profile

138  

Page 139: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey

simulator

offline during job runtime

utility function job stats latency

predictions

job profile

139  

Page 140: Jockey - EuroSys presentationcs.brown.edu/~adf/work/EuroSys2012-talk.pdfJockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric

Jockey

simulator

offline during job runtime

running job

utility function job stats latency

predictions

resource allocation control loop

job profile

140