Upload
patrick-carey
View
16
Download
3
Embed Size (px)
DESCRIPTION
Space-efficient Tracking of Persistent Items in a Massive Data Stream. Bibudh Lahiri and Srikanta Tirthapura. Electrical & Computer Engg ., Iowa State University. Jaideep Chandrashekar. Technicolor Labs, Palo Alto. ACM DEBS 2011. - PowerPoint PPT Presentation
Citation preview
Space-efficient Tracking of Persistent Items in a Massive Data
Stream
Electrical & Computer Engg., Iowa State University
1ACM DEBS 2011
Bibudh Lahiri and Srikanta Tirthapura
Jaideep Chandrashekar Technicolor Labs, Palo Alto
Temporal Persistence: A Not-so-Discussed Problem in Data Stream
• Motivation from security, formulation as a problem in streams
• Botnets, port scans, click fraud Appear in a temporally regular mannerDo the damage, yet evade the radarNot necessarily in large volume (stealthy)
Heavy-hitter algorithms do not work
2
State of the Art in Data Stream Research
• Frequency moments, heavy-hitter, entropy, varianceEnough to know how many times i Є 1,…m
occurs in stream, for all i
• Persistence: When does i occur in the stream? In how many slots, in total?
3
Persistent Behavior in Botnet Traffic
• Giroire et al1 Consecutive connections to same destination
often separated by an hour or more Most bots occur in 100% slots in a window when
slot-length (s) = 1 hrMyBot-8926 in 100% slots when s = 16 hrs!
• Li et al2
Periodic botnet events about every ½ hr1. “Exploiting Temporal Persistence to Detect Covert Botnet Channels”, RAID 2009 2. “Automating Analysis of Large-scale Botnet Probing Events”, ASIACCS 2009
4
Problem Definition
• Time is split into slots 1,2,…n of equal length• Stream S = <di, ti>; di: itemID, ti Є 1,2,…n
• Window Slr over [l, r] = (di, ti) Є S | l ≤ ti ≤ r
• pd(l,r) = persistence of d in Slr = #distinct slots
in [l,r] in which d appears
5
a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c
1 2 3 4 5 6 7
pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1
Problem Definition
• Item d is α-persistent in Slr : appears in at least
α(r-l+1) slots • With α = 0.5, a, b and c are α-persistent in
[4,7], d is not• Goal: To detect α-persistent items
6
a, d, b c, d, e a, c, d, b a, b, a, c b, c a, b, b c, c, d, c
1 2 3 4 5 6 7
pa(4,7) = 2, pb(4,7) = 3, pc(4,7) = 3, pd(4,7) = 1
Our Contributions
• Lower bound: Exact tracking needs Ω(|D|.log nα) space• Approximate tracking:
Detect items with pd ≥ (α-ε)n with high probability
Items with pd < (α-ε)n not reported as persistent
7
Our Contributions
• First algorithm for this problem with any provable guarantee
• Small-space algorithm Space complexity O(1/ε) for Zipfian distributions
• Upto 85% less physical memory than naïve algorithm• Typical FPR < 1%, FNR < 4%
8
Approximate Tracking
• Detect items with pd ≥ (α-ε)n whp
• Do not report items with pd < (α-ε)n
• Fixed window: pd computed over slots [1,n]
10
Intuition: Fixed-Window Algorithm
• “Sample and count”• Sample a random element in stream• Once sampled, count occurrences of the item
exactly• Persistence: count only one occurrence/slot• Sampling method
Send every (d,t) through a hash-based filterChance of passing filter = h(d,t) << 1 (in fact, 2/εn)
12
Intuition: Fixed-Window Algorithm
• Same d, same t: h(d,t) remains sameRe-occurrences in same slot does not help
• Same d, different t: h(d,t)’s are independent• (d,td,nd) initialized when (d,t) first passes filter
• Persistent item: Enough chances to cross filter• Transient item: Fewer chances
13
Intuition: Fixed-Window Algorithm
14
a b b b c a a ac cf
Slot 1 Slot 2 Slot 3 Slot 4
d Є S?
td < t ?
h(d,t) < 1/2?
No
Yes
(a, 1) (b,1)
(c,1)
No
Yes(c,2)
(a, 2)
(c,2, 1)
(a, 3)
(a, 3,1)
(f, 3)
(c, 4) Yes
No
(c,4, 2)
(a, 4)(a, 4,2)
a
(a, 4)
Performance: Fixed-Window Algorithm
• False Neg.: pd ≥ αn => Pr(reported transient) ≤ e-2 = 13%
• Drops to δ with O(log(1/δ)) parallel instances • pd < (α-ε)n => d never reported as persistent
• Space = O(P.log(1/δ)/εn), where P = ∑d Є D(S) pd
Reduces to O(1/ε) for Zipfian distribution
• Processing time per element O(log(1/δ))
15
Sliding Window Algorithm
• pdc: persistence of d in [c-n+1,c]
• Detect items with pdc ≥ (α-ε)n whp
• Do not report items with pdc < (α-ε)n
• Intuition Start a new fixed-window data structure St in
every distinct slot t where d occursWon’t that take too much space?
No…
17
Intuition: Sliding-Window Algorithm
• ObservationsOnly in few slots, d will pass filter and initialize St
In [c-n+1,…, j,…, c], if d passes filter first in j, then Sj represents pd
c most accurately Note: We save the space for Sc-n+1,Sc-n+2,…Sj-1
At c, we can discard any Sr where r ≤ c-n
• Sketch is (d, t, nd,t, td,t) when initialized, how many slots, most recent slot
18
Intuition: Sliding-Window Algorithm
19
a b c a a ac cf
Slot 1 Slot 2 Slot 3 Slot 4
(d,t) Є S? h(d,t) < 1/2?
No
Yes
(a, 1) (b,1)
(c,1)
No
Yes(c,2)
(a, 2)
(c,2, 1,2)
(a, 3)
(c, 4)
(a, 4) (a,2, 1,2)
(a,3, 1,3)
(a,2, 2,3)
(f, 3)
c
(c,3)(c,3, 1,3)
(c,2, 2,3)
(a,4, 1,4)
(a,3, 2,4)(c,3, 2, 4)
Evaluation
• Typically skewed distn
• 885 million packets, 30-sec slots => 350 slots in ~ 3 hrs data
• Query windows: [1,100], [26,125],…,[251,350]
• In [1,100] window, ~570k distinct IPs, but ~500k of them occur in < 10 slots
• Storing a counter for every distinct item is a waste of space
21
Evaluation
• FNR is mostly within 5%, even when ε = 0.49 for α = 0.7
• Even the highest FPR is < 3%
• Small-space algo saves up to 85% space compared to naïve 445 MB instead
of 3 GB
22
Summary
• Persistent items: important on its own Motivation: botnet detection, port scans
• Exact solution needs storing all distinct items• Approximate, small-space solutions for fixed
and sliding windows Asymptotically same space for both
• 70-85% saving in memory for typical values of α (0.5, 0.7) and ε (0.4α – 0.6α)
23