18
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang Armando Fox* David Patterson* Michael Jordan* *UC Berkeley Intel Labs Berkeley

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Embed Size (px)

Citation preview

Page 1: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

UC Berkeley

Online System Problem Detection by Mining Console Logs

Wei Xu* Ling Huang†

Armando Fox* David Patterson* Michael Jordan*

*UC Berkeley † Intel Labs Berkeley

Page 2: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

2

Why console logs?

• Detecting problems in large scale Internet services often requires detailed instrumentation

• Instrumentation can be costly to insert & maintain• High code churn• Often combine open-source building blocks that are not

all instrumented

• Can we use console logs in lieu of instrumentation?

+ Easy for developer, so nearly all software has them

– Imperfect: not originally intended for instrumentation

Page 3: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Problems we are looking for

• The easy case – rare messages

• Harder but useful - abnormal sequences

NORMAL

receiving blk_1received blk_1

receiving blk_2

ERROR

what is wrong with blk_2 ???

3

Page 4: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Overview and Contributions

* Large-scale system problem detection by mining console logs (SOSP’ 09)

• Accurate online detection with small latency

4

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing*

Free text logs200 nodes

Page 5: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Constructing event traces from console logs

• Parse: message type + variables• Group messages by identifiers (automatically

discovered)– Group ~= event trace

receiving blk_1

receiving blk_1

received blk_1received blk_1

reading blk_1reading blk_1

receiving blk_2

received blk_2

receiving blk_2

receiving blk_1

receiving blk_1

received blk_1received blk_1

reading blk_1reading blk_1

receiving blk_2

received blk_2

receiving blk_2

5

Page 6: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Online detection: When to make detection?

• Cannot wait for the entire trace• Can last arbitrarily long time

• How long do we have to wait?• Long enough to keep correlations• Wrong cut = false positive

• Difficulties• No obvious boundaries• Inaccurate message ordering • Variations in session duration

6

receiving blk_1

received blk_1

reading blk_1

reading blk_1

deleting blk_1

deleted blk_1

receiving blk_1

received blk_1

Time

Page 7: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Frequent patterns help determine session boundaries

• Key Insight: Most messages/traces are normal• Strong patterns

• “Make common paths fast”• Tolerate noise

7

Page 8: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

8

Two stage detection overview

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing

Free text logs200 nodes

Page 9: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Stage 1 - Frequent patterns (1): Frequent event sets

9

receiving blk_1

received blk_1reading blk_1

deleting blk_1

deleted blk_1

receiving blk_1

received blk_1

Time

Coarse cut by time

Find frequent item set

Refine time estimation

reading blk_1

error blk_1

Repeat until all patterns found

PCA Detection

Page 10: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

10

Stage 1 - Frequent patterns (2) : Modeling session duration time

• Assuming Gaussian?• 99.95th percentile estimation is off by half• 45% more false alarms

• Mixture distribution• Power-law tail + histogram head

DurationDuration

Cou

nt

Pr(

X>

=x)

Page 11: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

11

Stage 2 - Handling noise with PCA detection

• More tolerant to noise• Principal Component Analysis (PCA) based

detection

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing

Free text logs200 nodes

Page 12: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Frequent pattern matching filters most of the normal events

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing*

Free text logs200 nodes

100%86%

14% 13.97%

0.03%

Page 13: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

13

Evaluation setup

• Hadoop file system (HDFS)• Experiment on Amazon’s EC2 cloud

• 203 nodes x 48 hours• Running standard map-reduce jobs• ~24 million lines of console logs

• 575,000 traces• ~ 680 distinct ones• Manual label from previous work

• Normal/abnormal + why it is abnormal• “Eventually normal” – did not consider time• For evaluation only

Page 14: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

14

Frequent patterns in HDFS

Frequent Pattern 99.95th percentile Duration

% of messages

Allocate, begin write 13 sec 20.3%

Done write, update metadata 8 sec 44.6%

Delete - 12.5%

Serving block - 3.8%

Read exception - 3.2%

Verify block - 1.1%

Total 85.6%

• Covers most messages• Short durations

(Total events ~20 million)

Page 15: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

15

Detection latency

• Detection latency is dominated by the wait time

Single event pattern

Frequent pattern (matched)

Frequent pattern (timed out)

Non pattern events

Page 16: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

16

Detection accuracy

True Positives

False Positives

False Negatives

Precision Recall

Online 16,916 2,748 0 86.0% 100.0%

Offline 16,808 1,746 108 90.6% 99.3%

(Total trace = 575,319)

• Ambiguity on “abnormal”• Manual labels: “eventually normal”• > 600 FPs in online detection as very long latency• E.g. a write session takes >500sec to complete

(99.99th percentile is 20sec)

Page 17: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Future work

• Distributed log stream processing– Handle large scale cluster + partial failures

• Clustering alarms• Allowing feedback from operators• Correlation on logs from multiple applications /

layers

17

Page 18: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel

Summary

http://www.cs.berkeley.edu/~xuw/Wei Xu <[email protected]>

Precision Recall

Online 86.0% 100.0%

18

Frequent pattern based filtering

OK

PCA DetectionOK

ERROR

Dominant cases

Non-patternNormal cases

Real anomalies

Parsing

Free text logs200 nodes24 million lines