Detecting Discontinuties in Large Scale Systems

Detecting Discontinuitiesin Large-Scale Systems

Haroon MalikSoftware Architecture Group (SWAG)

University of Waterloo, Waterloo, Canada

Ian John DavisSoftware Architecture Group (SWAG)


Michael GodfreySoftware Architecture Group (SWAG)


Douglas Neuse & Serge Mankovskii

Capacity Planning GroupCA Technologies, USA

2Datacenters Require Good Forecasts

Forecasting Steps

1 2 3 4 5

Determine purpose Select technique Prepare data Prepare forecast Monitor forecast

3

Forecasting Steps

1 2 3 4 5


4

Forecasting Steps

1 2 3 4 5


5

Forecasting Steps

1 2 3 4 5


6

Forecasting Steps

1 2 3 4 5


7

Forecasting Steps

1 2 3 4 5


Challenges (a) Large volumes of performance data, (b) Limited time, (c) Domain knowledge

8

Discontinuities

0

1

2

3

4

5

6

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

10

0

10

3

10

6

10

9

11

2

11

5

11

8

12

1

12

4

12

7

13

0

13

3

13

6

13

9

14

2

Mag

nit

ud

e

Time (Days)

Discontinuity

Anomalies

9

Discontinuities

Reasons:

1. Company merge

2. Hardware upgrade

3. Software change (new release)

4. Workload change

5. Promotional customers

10

(a)

T1 T2 T3

(b)

Transition Period

(c) (d)

Symptoms:

Why Care About Discontinuities?

• Measurements taken before the discontinuity canskew the forecast.

• Detecting a discontinuity provide analysts with areference point to retrain their forecasting modelsand make necessary adjustments.

We propose an automated approach to help analyst identify discontinuities in

performance data

11

Steps Involved in The Proposed Approach

12

Performance

logs

Report

(discontinuities)Data

preparation

Metric

selection

Anomaly

detection

Discontinuity

identification

1 2 3 4

InputApproach

Output

1. Data PreparationThe performance logs from theproduction have noise:

o Missing counters

o Empty counters

o Different numerical ranges

13

We used statistical techniques to filter noise in the data

Data

preparation

Metric

selectionAnomaly

detection

Discontinuity

identification

2.Metric SelectionProduction logs containthousands of counters that are:

o Highly correlated

o Invariants

o Configuration constants

14

We used Principal-Component-Analysis (PCA) to

select important metrics

Data

preparation

Metric

selectionAnomaly

detection

Discontinuity

identification

3. Anomaly DetectionQuadratic Modelling

o Quadratic Function thatminimize LSE

o A greedy algorithm toreplace performancecounter time series data

o Cost metric to reflectdata fit

15

Largest costs suggest positions intime series value where the mostegregious anomalies anddiscontinuities occur

Data

preparation

Metric

selection

Anomaly

detection

Discontinuity

identification

3. Anomaly Detection (Quadratic Model)

Co

un

ter

Val

ue

16

3. Anomaly Detection (Quadratic Model)

Co

un

ter

Val

ue

17

Co

st

4. Discontinuity Identification

Distribution comparisono Difference of mean between

two population

o Quantify the difference ofmean between two population

18

Data

preparation

Metric

selection

Anomaly

detection

Discontinuity

identification

19

Transition Period Transition Period

Anomaly Anomaly

Discontinuity

% C

PU

Uti

lizat

ion

Difference of Mean Between Two Populations


20


Anomaly Anomaly

Discontinuity

% C

PU

Uti

lizat

ion

Co

st


21


Anomaly Anomaly

Discontinuity

% C

PU

Uti

lizat

ion

21


% C

PU

Uti

lizat

ion

Wilcoxon Rank-Sum Test H0 = The two distributions are same


22


Anomaly Anomaly

Discontinuity

% C

PU

Uti

lizat

ion

22


% C

PU

Uti

lizat

ion



23


Anomaly Anomaly

Discontinuity

% C

PU

Uti

lizat

ion

23


% C

PU

Uti

lizat

ion


Quantify the Difference of Mean Between Two Populations

COHEN’S-D A tunable threshold

𝒆𝒇𝒇𝒆𝒄𝒕 𝒔𝒊𝒛𝒆 =

𝒕𝒓𝒊𝒗𝒊𝒂𝒍𝒔𝒎𝒂𝒍𝒍

𝒎𝒆𝒅𝒊𝒖𝒎𝒍𝒂𝒓𝒈𝒆

𝒊𝒇 𝑪𝒐𝒉𝒆𝒏′𝒔 𝒅 ≤ 𝟎. 𝟐

𝒊𝒇 𝟎. 𝟐 < 𝑪𝒐𝒉𝒆𝒏′𝒔 𝒅 ≤ 𝟎. 𝟓

𝒊𝒇 𝟎. 𝟓 < 𝑪𝒐𝒉𝒆𝒏′𝒔 𝒅 ≤ 𝟎. 𝟖

𝒊𝒇 𝟎. 𝟖 < 𝑪𝒐𝒉𝒆𝒏′𝒔 𝒅

24

Analysts based on their domain trends and required granularity set the effect size

Acts as a tunable threshold to reduce false positive identification of discontinuity by our

approach

Cohen’s d

Subjects of Study

DVD Store

System: Open SourceDomain: EcommerceType of Data: Performance Tests

System: SimulationDomain: Cloud ComputingType of Data: Synthetic Data

25

System: Industrial SystemDomain: Cloud ComputingType of Data: Production Data

Fault Injection

Category Types of Faults

Anomalies

CPU Stress

Memory Stress

Interfering Workload

Discontinuities

Workload as Multiplicative Factor

Change in Transaction Pattern

Hardware & Software Upgrade

26

We had NO prior knowledge of the underlying fault in the data obtained from the industrial system

Results

0.92

0.72Proposed technique has high accuracy

in detecting discontinuities

Experts verified the results for the industrial system

27

0.83

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sythetic Dell Dvd Store IndustiralSystem (CA)

F-m

eas

ure

Results

0.92




28

0.83

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


F-m

eas

ure

Results

0.92




29

0.83

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


F-m

eas

ure

Limitations of Our Approach

o We can tune the sensitivity of our approach by adjusting effect size.

oUsing large effect size reduces false alarms, this may result in an analyst overlooking significantdiscontinuities.

oAnalysts have to conduct multiple experiments

30

Sensitivity

Determining a threshold value is a problem Anautomated techniques, generally can not decidewhether identified discontinuity is important or is noise.

Limitations of Our Approach

The approach can not distinguish between o Overlapping discontinuities and

o Different type of discontinuities.

31

Distinguisibility

Analysts have to manually inspect the identified discontinuity and take actions

Distinguishability

32

33

QUESTIONS……

Data & Analytics

Detecting Discontinuties in Large Scale Systems