Analyzing Real Cluster Data for Formulating Allocation ... · Need for Models in Computer Science Lack of real usage data from Cloud infrastructures (privacy, lock-in, scale...) I

Olivier Beaumont, Lionel Eyraud-Dubois and Juan Angel Lorenzo del Castillo October 24, 2014

Analyzing Real Cluster Data for FormulatingAllocation Algorithms in Cloud PlatformsInria Bordeaux - Sud-Ouest

Need for Models in Computer Science

Lack of real usage data from Cloud infrastructures(privacy, lock-in, scale...)

I Google recently (Nov. 2011) released a trace of usage datafrom one of its (huge) clusters.

I There are other few traces available, but none of them are sodetailed.

Olivier Beaumont et al. - Analyzing Cluster Data for Allocation Algorithms in Clouds. October 24, 2014 - 2

The Problem

Resource allocation algorithms for Cloud Computingor How to allocate a set of services or virtual machines (VMs) on a set of

physical machines (PMs)

I Objective: Optimize resource usage, maximize QoS, ensure SLAs,limit number of migrations...

I Diverse approaches: Online/offline Bin-Packing algorithms (FirstFit, Best Fit). Known to be notoriously difficult.

I Most relevant aspects?: Dynamicity, fault tolerance,multidimensional resources, additional user-supplied constraints, ...

No consensus on the algorithmic models


Objective

1. Find new characteristics of the trace and exhibit the mainproperties of its jobs.

2. Propose a set of very few parameters that account for themain characteristics of the trace.

Ultimately, this work aims at:• Leveraging the design of efficient allocation algorithms• Fostering the generation of realistic random traces


Objective

1. Find new characteristics of the trace and exhibit the mainproperties of its jobs.

2. Propose a set of very few parameters that account for themain characteristics of the trace.

Ultimately, this work aims at:• Leveraging the design of efficient allocation algorithms• Fostering the generation of realistic random traces


The Google Cluster Trace

186 GB of data Detailed workload. ∼700000 jobs

. Each job, multiple tasks onLXCs (Linux Containers)

12583 heterogeneousmachines

. Each task assigned to a singlephysical machine

Exhaustive information. Actual CPU and memory usage per

task, execution time, job priorities...

. Collected during 29 days on 5-minutemonitoring intervals

Priority groups. Low-priority tasks can be

evicted/migrated in favor ofhigher-priority ones


Important Questions

Designing efficient trace models raises several questions:

Profile Model Premise Questions

Static Small set of parameters Set of jobs representative ofthe whole trace usage?

Dynamics Statistical prediction.Re-computation frequency.

Variation of jobs over time? Lifespandistribution? Variation on usage pat-terns?

AdvancedFeatures

Multi-dimensional Bin-Packing problems simplifiedif dimensions correlated.

Correlation between jobs’ dimensions(CPU, memory)?

FaultTolerance

Quantification of qualitiesand limits of a model

Frequency of failures? Correlated orindependent?


Dominant Jobs

Can we find a set of jobs representative of the whole trace?

I 3.8% of all jobs accounts for 94.7% ofCPU and 90% of memory usage.

I Large number of tasks per job (470 onaverage).

I Most usage focused on NormalProduction jobs.

I Similar number of tasks in NormalProduction (27%) and Gratis (21%) jobs.

Average resource usage of dominant jobsstacked by priority class.

The rest of this work focuses on this set of Dominant Jobs


Dominant Jobs

Can we find a set of jobs representative of the whole trace?

I 3.8% of all jobs accounts for 94.7% ofCPU and 90% of memory usage.

I Large number of tasks per job (470 onaverage).

I Most usage focused on NormalProduction jobs.

I Similar number of tasks in NormalProduction (27%) and Gratis (21%) jobs.

Average resource usage of dominant jobsstacked by priority class.

The rest of this work focuses on this set of Dominant Jobs


Workload Characterization

How is the resource usage of dominant jobs distributed?I Can it be easily modeled?

Are the resource usage dimensions correlated?I Fewer parameters reduce the complexity of packing algorithms

Are Dominant Jobs stable over time?Do they exhibit any patterns?

I To estimate the stability w.r.t. their priorityI So that the resource usage at a given time can be predicted


Workload CharacterizationHow is the resource usage of jobs distributed?

I It can be modeled by a mixture of two lognormal distributions.I Most of usage by Normal Production jobs. Some jobs in Gratis and

Other use more resources punctually.

Gratis NormProduction Other

0.0

0.1

0.2

0.3

0.4

−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6Log of CPU usage

Den

sity

Priority class

Gratis

NormProduction

Other

Distribution

Bimodal

Data

Gaussian

Distribution of CPU usage

Gratis NormProduction Other

0.0

0.1

0.2

0.3

0.4

−2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5Log of memory usage

Den

sity

Priority class

Gratis

NormProduction

Other

Distribution

Bimodal

Data

Gaussian

Distribution of memory usage


Workload CharacterizationFor a given job, does memory usage of tasks

depend on its CPU usage?

1. Dominant Jobs sampled on 20 random timestamps per day2. For each job, tasks clustered into groups attending to CPU-memory usage3. Linear regression on each cluster of tasks4. Well-fitted jobs classified into Flat (memory constant) or Slope (memory

affine to CPU)67% 6.5% 26.5% BadlyFitted

CPU (x-axis) vs Memory (y-axis) of four jobs with different usage patterns. Each dot represents a task.

Conclusion: Memory usage w.r.t. CPU can be modeled as affine or constantfor most of jobs (> 70%)



depend on its CPU usage?1. Dominant Jobs sampled on 20 random timestamps per day2. For each job, tasks clustered into groups attending to CPU-memory usage3. Linear regression on each cluster of tasks4. Well-fitted jobs classified into Flat (memory constant) or Slope (memory

affine to CPU)

67% 6.5% 26.5% BadlyFitted
















Workload Characterization

Are Dominant Jobs stable over time?

Priority Percentage Duration

Gratis 50%1%

< 25 minutes> 30 hours

Other 50%1%

< 25 minutes> 15 hours

Normal Production 50%15.6%

> 31.7 hourswhole trace


Workload CharacterizationDo Dominant Jobs exhibit any patterns over time?

Correlation of resource usage over time among jobs is importantfor efficient job allocation

I Two positively correlated jobs (peak at the same time) can be allocatedin different machines to avoid starvation

I Two negatively correlated jobs (peak at different times) can be packedtogether to achieve better resource utilization

We have performed an analysis of the periodicity of the resourceusage of jobs

I The resource usage can be approximated by a periodic functionI Analysis of the main components of the spectrumI Analysis restricted to the dominant Normal Production jobs (long enough)I Harmonics removalI Quantification of amplitude, phase, frequency and background noise


Workload CharacterizationDo Dominant Jobs exhibit any patterns?

02 05 08 11 14 17 20 23 26 29

Day0

1

2

3

4

5

6

7

Cpu

(cor

e-se

c/se

c)

CPU usage of a Normal Production job with daily and weekly patterns.

Patterns:I >50% jobs strong daily patternsI 67% jobs weekly patterns (5 days high usage, 2 days lower usage)

Phase difference:I >50% jobs: <60 degrees (4 hours) apartI >90% jobs: <120 degrees (8 hours) apart


Machine Failure Characterization

Can machine failures be modeled after a distribution?

Assumptions:I Machines fail independentlyI Failure probability is constant

Model:I Poisson distribution P(λ)

I λ = average number of failures = 0.97 1e−01

1e+01

1e+03

0 1 2 3 4 5 6 7 8 9Number of events

Num

ber

of ti

me

win

dow

s

Value

Actual

Expected

Distribution of machine removal events

Actual versus expected distribution of failures


Work Outcome

Modeling the Workload

I How many parameters to include (realism vs. overfitting)?I It depends on the system being modeled


Work Outcome

Allocation algorithms

1. Focus on jobs, not tasks. Examples:I Job load balanced among its tasksI Job allocation computed globally + greedy allocation for individual

tasks2. Describe jobs with their aggregated CPU and memory3. Consider correlation among dimensions (CPU, memory)4. Consider at least daily and weekly patterns5. Machines can be assumed to have independent failures and a failure rate

of 10−5 per hour


Future Work

I To propose a complete generating model of the identified parametersI Characterization of machine failures over timeI Design and validation of efficient resource allocation algorithms

I Example: Allocation of services with periodic resource usage byco-locating jobs with compatible peak times


Olivier Beaumont, Lionel Eyraud-Dubois and Juan Angel Lorenzo del Castillo October 24, 2014

Analyzing Real Cluster Data for FormulatingAllocation Algorithms in Cloud PlatformsInria Bordeaux - Sud-Ouest

Backup Slides


State transitions for jobs and tasks

State transitions for jobs and tasks.

Source: Google cluster-usage traces format + schemaCharles Reiss, John Wilkes, Joseph HellersteinVersion of 2013.05.06, for trace version 2.Copyright c©2011 Google Inc. All rights reserved.


Trace Timeline

Mapping of original times to times emitted in the trace.

Source: Google cluster-usage traces format + schemaCharles Reiss, John Wilkes, Joseph HellersteinVersion of 2013.05.06, for trace version 2.Copyright c©2011 Google Inc. All rights reserved.


Trace Utilization

Mapping of original times to times emitted in the trace.

Source: Towards understanding heterogeneous clouds at scale: Google trace analysisCharles Reiss (UC Berkeley), Alexey Tumanov (CMU), Gregory R. Ganger (CMU),Randy H. Katz (UC Berkeley), Michael A. Kozuch (Intel Labs)Intel Science & Technology Center for Cloud ComputingCarnegie Mellon University


Tasks per Job

0 5000 10000 15000 20000 25000 30000 35000# tasks

0.0

0.2

0.4

0.6

0.8

1.0

Jobs

CDF of number of tasks.


Documents

Analyzing Real Cluster Data for Formulating Allocation ... · Need for Models in Computer Science Lack of real usage data from Cloud infrastructures (privacy, lock-in, scale...) I