24
Context-Aware Time Series Anomaly Detection for Complex Systems Manish Gupta, UIUC Abhishek B. Sharma, NEC Labs America Haifeng Chen, NEC Labs America Guofei (Geoff) Jiang, NEC Labs America 1 SDM4Service 5/4/2013

Context-Aware Time Series Anomaly Detection for Complex Systems

  • Upload
    amalie

  • View
    75

  • Download
    0

Embed Size (px)

DESCRIPTION

Context-Aware Time Series Anomaly Detection for Complex Systems. Manish Gupta, UIUC Abhishek B. Sharma , NEC Labs America Haifeng Chen, NEC Labs America Guofei (Geoff) Jiang, NEC Labs America. Focus on Complex Systems. Lots of components interacting to accomplish challenging tasks. - PowerPoint PPT Presentation

Citation preview

Page 1: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 1

Context-Aware Time Series Anomaly Detection for Complex Systems

Manish Gupta, UIUCAbhishek B. Sharma, NEC Labs AmericaHaifeng Chen, NEC Labs AmericaGuofei (Geoff) Jiang, NEC Labs America

Page 2: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 2

Focus on Complex Systems• Lots of components interacting to accomplish challenging

tasks

Data centers Power plant Manufacturing plant

Page 3: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 3

Switch to proactive maintenance1. Continuous monitoring from multiple vantage points2. Replace calendar-based or reactive maintenance with early

detection, localization, and remediation• Enabling technologies

– low cost ubiquitous sensing and communication• Challenge

– How to combine heterogeneous data?– Unstructured or semi-structured log data– Multivariate time series data

Page 4: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 4

Importance of Collating Information• Only time series data

– No context or the global semantic view– Many false positives– Multiple alarms related to a single

event• Only system logs

– High level application/workflow view– Incomplete Coverage

• “Cost”– Lack of root-cause visibility

• Absence of observed system behavior

0 10000 20000 30000 40000 50000 60000 700000

0.010.020.030.040.050.060.070.080.09

0.1

CPU Utilization

Mem Utilization

Time (s)

Nor

mal

ized

Mea

sure

men

t

Task execution

Page 5: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 5

Our vision• Logs capture the context of a system’s operations• Time series monitoring data record the state of different

components• Hypothesis: jointly mining log and time series data for

anomaly detection is more accurate and robust.– Context-aware time series anomaly detection

Page 6: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 6

Outline1. Introduction and Motivation2. Framework for combining logs and time series data3. Proposed solution4. Instantiation details for Hadoop5. Evaluation 6. Conclude

Page 7: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 7

Framework for combining logs and time series data

Time series

Page 8: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 8

What is an instance?• An instance spans the interval between two consecutive

context changing events on a component.– Assumption: we can identify context changing events.

• Instance I = (C,M); C: content features, M: metrics/time series

t1 t2

t1: Task execution startst2: Task execution finishes

Page 9: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 9

Problem statement and solution approach

• Given: Instances I1, I2 , …, IL

• Find: Top K anomalous instances• 2 stage solution• Find patterns

• Context patterns• Metric patterns

• Find anomalies

Two notions of similarity:Peer similarity: similarity in context variables across instancesTemporal similarity: similarity in time series data for similar contexts

Page 10: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 10

Proposed Solution• Extraction of Context Patterns

– Normalize the data– Use K-means clustering

• Extraction of Metric Patterns

C1

C2

C3

M1

M2M3

M4

Not an anomalyAnomaly

𝑠𝑐𝑜𝑟𝑒 ( 𝐼 )=1−𝑠𝑖𝑚 (𝑀 ,𝑀 2 )

Anomaly Detection

Anomaly Post-processing– Remove instance if nearest context

cluster is far away.

CPUMemory

Disk Read

Disk Write

eth0 TX

eth0 RX CPU

10 20.9 622.4 14.0 4.2 5.6 10

20 24.5 977.7 82.7 4.0 1.8 20

10 24.6 836.4 90.0 1.6 1.6 10

30 20.6 198.6 78.6 7.1 0.3 3040 29.3 850.9 99.1 5.1 10.0 40

similarity?

𝑷 𝒄×𝒌 𝑸𝒄×𝒌

similarity?

CPUMemory

Disk Read

Disk Write

eth0 TX

eth0 RX CPU

10 20.9 622.4 14.0 4.2 5.6 1020 24.5 977.7 82.7 4.0 1.8 2010 24.6 836.4 90.0 1.6 1.6 1030 20.6 198.6 78.6 7.1 0.3 30

Page 11: Context-Aware Time Series Anomaly Detection for Complex Systems

11

Instantiating the framework for MapReduce (Hadoop)

• MapReduce programming model– Example: count the frequency of all words appearing in a document

• Distributed block storage (e.g. HDFS)• Two phases of computation: Map and Reduce

A B C

B C D

E F G

A B D

A: 1 B: 1 C: 1

B: 1 C: 1 D: 1

E: 1 F: 1 G: 1

A: 1 B: 1 D: 1

Map

Map

Map

Map

Intermediate output

Reduce

A: 2B: 3C: 2D: 2E: 1F: 1G: 1

Final output

SDM4Service 5/4/2013

Page 12: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 12

Hadoop: Open source implementation of MapReduce runtime

Map and Reduce phases exhibitpeer and temporal similarity

Page 13: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 13

Discussion• Selecting number of principal components ()– Capture >95% variance for both time series.

• Selecting number of context/metric clusters– Knee point of within cluster sum of squares versus

# clusters curve.• Richer context for MapReduce– Job conf parameters – Events extracted using regex pattern matches from

logs.

Page 14: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 14

Evaluation1. Synthetic datasets

– Context part comes from real Hadoop runs.– Metrics part is synthetically generated.– Hadoop cluster: master + 5 slaves.– Workload: standard Hadoop examples like sorting, count word

frequencies, etc.– 3 context clusters.

2. Real Hadoop runs with injected faults– CPU hog and Disk hog

Page 15: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 15

Synthetic data:Context Clusters for Hadoop Examples

#Map

s

#Red

uces

COM

BIN

E O

UTPU

T RE

CORD

S

COM

MIT

TED

HEAP

BYT

ES

CPU

MILL

ISEC

ON

DS

FILE

BYT

ES W

RITT

EN

HDFS

BYT

ES R

EAD

MAP

INPU

T BY

TES

MAP

INPU

T RE

CORD

S

MAP

OUT

PUT

BYTE

S

MAP

OUT

PUT

MAT

ERIA

LIZED

BYT

ES

PHYS

ICAL

MEM

ORY

BYT

ES

RECO

RDS

WRI

TTEN

REDU

CE IN

PUT

GRO

UPS

REDU

CE IN

PUT

RECO

RDS

REDU

CE O

UTPU

T RE

CORD

S

REDU

CE S

HUFF

LE B

YTES

SPILL

ED R

ECO

RDS

SPLIT

RAW

BYT

ES

VIRT

UAL M

EMO

RY B

YTES-2

-1

0

1

2

3Cluster1 Cluster2 Cluster3

Nor

mal

ized

Mea

sure

men

t

• Cluster 1: large number of Map tasks high values for Map counters. • Cluster 2: instances with a few Map and a few Reduce tasks.• Cluster 3: instances with large number of Reduce tasks and high values for

Reduce counters.

Page 16: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 16

Injecting Anomalies in Synthetic Dataset

• Fix anomaly factor=• Randomly select instances into set R• For each instance in R, choose to add swap-

anomaly or new-anomaly.• Swap Anomaly: Swap the metrics part with

another randomly chosen instance.• New Anomaly: Replace the metrics time series

part with a new random matrix.

Page 17: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 17

Synthetic Dataset Results

• 20 experiments per setting.• Avg. standard deviations are 3.34% for CA, 7.06 % for SI and 4.58%

for NC.

SI (1%)NC (28%)

Page 18: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 18

Results on real Hadoop runs with injected faults

1 15 29 43 57 71 85 99 1131270

0.050.1

0.150.2

0.250.3

0.35Disk HogCPU Hog

Instance Number

Anom

aly

Scor

e

• Original number of anomalies– Disk hog: 7.– CPU hog: 4.

• Detected anomalies– Disk hog: 4 in top5, all 7 in top 10.– CPU hog: 3 in top 5, all 4 in top 10.

1 12 23 34 45 56 67 780

20

40

60

80

100Anomaly

Metric Cluster 0

Metric Cluster 1

Metric Cluster 2

Time (sec)

CPU

Utiliz

ation

Page 19: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 19

Conclusion and Future work• Proactive maintenance is more effective when we combine

information from heterogeneous sources– System logs and time series measurements

• We proposed a clustering based approach for finding context patterns from log data and metric patterns from time series– Use these patterns for anomaly detection

• Future directions– How to define context and instances in other settings?– Define anomalies based on transition in context and expected change

in metrics

Page 20: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 20

Appendix

Page 21: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 21

500 1000 2000 50000

5000

10000

15000

20000

25000

30000#Metrics=5#Metrics=10#Metrics=20

Number of instances (N)

Exec

ution

Tim

e fo

r Met

ric

Patte

rns D

isco

very

(sec

)

Running Time

• Algorithm is linear in number of instances.• Time spent in anomaly detection: ~188ms.

Page 22: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 22

Real Datasets

• Workload: Multiple runs of RandomWriter and Sort.

• RandomWriter (16 Maps) writes 1 GB data in 64 MB chunks and Sort (16 Maps and 16 Reduces) sorts the data.

• Anomalies are inserted on 1 machine for– CPU Hog: Infinite loop.– Disk Hog: Sequential write to file on disk.

• Total instances: 134 (Disk Hog) & 121 (CPU Hog).

Page 23: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 23

Context Clusters for RandomWriter+Sort Dataset

#Map

s

#Red

uces

COM

MIT

TED

HEAP

BYT

ES

CPU

MILL

ISEC

ON

DS

FILE

BYT

ES R

EAD

FILE

BYT

ES W

RITT

EN

HDFS

BYT

ES W

RITT

EN

MAP

OUT

PUT

MAT

ERIA

LIZED

BYT

ES

MAP

OUT

PUT

RECO

RDS

PHYS

ICAL

MEM

ORY

BYT

ES

RECO

RDS

WRI

TTEN

REDU

CE IN

PUT

RECO

RDS

REDU

CE O

UTPU

T RE

CORD

S

REDU

CE S

HUFF

LE B

YTES

SPILL

ED R

ECO

RDS

SPLIT

RAW

BYT

ES

VIRT

UAL M

EMO

RY B

YTES-3

-2-1012 Cluster1 Cluster2 Cluster3

Nor

mal

ized

Mea

sure

men

t

• Cluster 1 consists of a mix of Maps and Reduces and has a distinctly high number of HDFS bytes being written.

• Cluster 2 is Map-heavy and shows a large number of Map Output Records.• Cluster 3 is Reduce-heavy and hence demonstrates a large activity in Reduce

counters.

Page 24: Context-Aware Time Series Anomaly Detection for Complex Systems

SDM4Service 5/4/2013 24

Metric Patterns