24
© Copyright 2002-2007 R.J. Rusay © Copyright 2002-2007 R.J. Rusay Oxidation-Reduction Oxidation-Reduction Dr. Ron Rusay Dr. Ron Rusay Fall 2007 Fall 2007

Space-efficient Tracking of Persistent Items in a Massive Data Stream

Embed Size (px)

DESCRIPTION

Space-efficient Tracking of Persistent Items in a Massive Data Stream. Bibudh Lahiri and Srikanta Tirthapura. Electrical & Computer Engg ., Iowa State University. Jaideep Chandrashekar. Technicolor Labs, Palo Alto. ACM DEBS 2011. - PowerPoint PPT Presentation

Citation preview

Space-efficient Tracking of Persistent Items in a Massive Data

Stream

Electrical & Computer Engg., Iowa State University

1ACM DEBS 2011

Bibudh Lahiri and Srikanta Tirthapura

Jaideep Chandrashekar Technicolor Labs, Palo Alto

Temporal Persistence: A Not-so-Discussed Problem in Data Stream

• Motivation from security, formulation as a problem in streams

• Botnets, port scans, click fraud Appear in a temporally regular mannerDo the damage, yet evade the radarNot necessarily in large volume (stealthy)

Heavy-hitter algorithms do not work

2

State of the Art in Data Stream Research

• Frequency moments, heavy-hitter, entropy, varianceEnough to know how many times i Є 1,…m

occurs in stream, for all i

• Persistence: When does i occur in the stream? In how many slots, in total?

3

Persistent Behavior in Botnet Traffic

• Giroire et al1 Consecutive connections to same destination

often separated by an hour or more Most bots occur in 100% slots in a window when

slot-length (s) = 1 hrMyBot-8926 in 100% slots when s = 16 hrs!

• Li et al2

Periodic botnet events about every ½ hr1. “Exploiting Temporal Persistence to Detect Covert Botnet Channels”, RAID 2009 2. “Automating Analysis of Large-scale Botnet Probing Events”, ASIACCS 2009

4

Problem Definition

• Time is split into slots 1,2,…n of equal length• Stream S = <di, ti>; di: itemID, ti Є 1,2,…n

• Window Slr over [l, r] = (di, ti) Є S | l ≤ ti ≤ r

• pd(l,r) = persistence of d in Slr = #distinct slots

in [l,r] in which d appears

5

a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c

1 2 3 4 5 6 7

pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1

Problem Definition

• Item d is α-persistent in Slr : appears in at least

α(r-l+1) slots • With α = 0.5, a, b and c are α-persistent in

[4,7], d is not• Goal: To detect α-persistent items

6

a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c

1 2 3 4 5 6 7

pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1

Our Contributions

• Lower bound: Exact tracking needs Ω(|D|.log nα) space• Approximate tracking:

Detect items with pd ≥ (α-ε)n with high probability

Items with pd < (α-ε)n not reported as persistent

7

Our Contributions

• First algorithm for this problem with any provable guarantee

• Small-space algorithm Space complexity O(1/ε) for Zipfian distributions

• Upto 85% less physical memory than naïve algorithm• Typical FPR < 1%, FNR < 4%

8

Talk Organization

• Introduction• Fixed-window algorithm• Sliding-window algorithm• Evaluation

9

Approximate Tracking

• Detect items with pd ≥ (α-ε)n whp

• Do not report items with pd < (α-ε)n

• Fixed window: pd computed over slots [1,n]

10

Talk Organization

• Introduction• Fixed-window algorithm• Sliding-window algorithm• Evaluation

11

Intuition: Fixed-Window Algorithm

• “Sample and count”• Sample a random element in stream• Once sampled, count occurrences of the item

exactly• Persistence: count only one occurrence/slot• Sampling method

Send every (d,t) through a hash-based filterChance of passing filter = h(d,t) << 1 (in fact, 2/εn)

12

Intuition: Fixed-Window Algorithm

• Same d, same t: h(d,t) remains sameRe-occurrences in same slot does not help

• Same d, different t: h(d,t)’s are independent• (d,td,nd) initialized when (d,t) first passes filter

• Persistent item: Enough chances to cross filter• Transient item: Fewer chances

13

Intuition: Fixed-Window Algorithm

14

a b b b c a a ac cf

Slot 1 Slot 2 Slot 3 Slot 4

d Є S?

td < t ?

h(d,t) < 1/2?

No

Yes

(a, 1) (b,1)

(c,1)

No

Yes(c,2)

(a, 2)

(c,2, 1)

(a, 3)

(a, 3,1)

(f, 3)

(c, 4) Yes

No

(c,4, 2)

(a, 4)(a, 4,2)

a

(a, 4)

Performance: Fixed-Window Algorithm

• False Neg.: pd ≥ αn => Pr(reported transient) ≤ e-2 = 13%

• Drops to δ with O(log(1/δ)) parallel instances • pd < (α-ε)n => d never reported as persistent

• Space = O(P.log(1/δ)/εn), where P = ∑d Є D(S) pd

Reduces to O(1/ε) for Zipfian distribution

• Processing time per element O(log(1/δ))

15

Talk Organization

• Introduction• Fixed-window algorithm• Sliding-window algorithm• Evaluation

16

Sliding Window Algorithm

• pdc: persistence of d in [c-n+1,c]

• Detect items with pdc ≥ (α-ε)n whp

• Do not report items with pdc < (α-ε)n

• Intuition Start a new fixed-window data structure St in

every distinct slot t where d occursWon’t that take too much space?

No…

17

Intuition: Sliding-Window Algorithm

• ObservationsOnly in few slots, d will pass filter and initialize St

In [c-n+1,…, j,…, c], if d passes filter first in j, then Sj represents pd

c most accurately Note: We save the space for Sc-n+1,Sc-n+2,…Sj-1

At c, we can discard any Sr where r ≤ c-n

• Sketch is (d, t, nd,t, td,t) when initialized, how many slots, most recent slot

18

Intuition: Sliding-Window Algorithm

19

a b c a a ac cf

Slot 1 Slot 2 Slot 3 Slot 4

(d,t) Є S? h(d,t) < 1/2?

No

Yes

(a, 1) (b,1)

(c,1)

No

Yes(c,2)

(a, 2)

(c,2, 1,2)

(a, 3)

(c, 4)

(a, 4) (a,2, 1,2)

(a,3, 1,3)

(a,2, 2,3)

(f, 3)

c

(c,3)(c,3, 1,3)

(c,2, 2,3)

(a,4, 1,4)

(a,3, 2,4)(c,3, 2, 4)

Talk Organization

• Introduction• Fixed-window algorithm• Sliding-window algorithm• Evaluation

20

Evaluation

• Typically skewed distn

• 885 million packets, 30-sec slots => 350 slots in ~ 3 hrs data

• Query windows: [1,100], [26,125],…,[251,350]

• In [1,100] window, ~570k distinct IPs, but ~500k of them occur in < 10 slots

• Storing a counter for every distinct item is a waste of space

21

Evaluation

• FNR is mostly within 5%, even when ε = 0.49 for α = 0.7

• Even the highest FPR is < 3%

• Small-space algo saves up to 85% space compared to naïve 445 MB instead

of 3 GB

22

Summary

• Persistent items: important on its own Motivation: botnet detection, port scans

• Exact solution needs storing all distinct items• Approximate, small-space solutions for fixed

and sliding windows Asymptotically same space for both

• 70-85% saving in memory for typical values of α (0.5, 0.7) and ε (0.4α – 0.6α)

23

Thank You !

24