View
216
Download
0
Category
Preview:
Citation preview
1
Distributed Streams Algorithms for Sliding Windows
Phillip B. Gibbons,
Srikanta Tirthapura
2
Abstract
• Algorithm for estimating aggregate functions over a “sliding window” of the N most recent data items in one or more streams.
3
Single stream
• The first E-approximation scheme for number of 1’s in a sliding window.
• The first E-approximation scheme for the sum of integers in [0..R] in a sliding window.
• Both algorithms are optimal in worst case time and space.
• Both algorithms are deterministic
4
Distributed Streams
• The first randomized E-approximation scheme for the number of 1’s in a sliding window over the union of distributed streams.
5
Usage
• Network Monitoring
• Data Warehousing
• Telecommunications
• Sensor Networks
6
• Multiple Data Source - Distributed Stream Model
• Only the most recent data is important - “Sliding Window”
7
The Goal in the algorithms
• Approximating a function F while minimizing :
• 1. The total memory
• 2. The time take by each party to
process a data item
• 3. The time to produce an estimate -
query time
8
Definition 1-An -approximation scheme for a quantity X • A randomized procedure that, given any
positive <1 and <1, compute an estimate :
• -approximate :An estimate whose worst case relative error is at most
9
An Example for Basic Counting Problem
10
Algorithms for Distributed Stream
• Each party observes only its own stream
• Each party communicates with other parties only when estimate is requested
• Each party sends a message to a Referee who computes the estimate
11
The Idea
• Storing a wave consisting of many random samples of the stream.
• Samples that contain only the recent items are sampled at a high probability, while those containing old items are sampled at a lower probability
12
Contributions• Introducing a data structures called
waves
• Presenting the first E-approximation scheme for Basic Counting.
• Presenting the first E-approximation scheme for the sum of integers in [0..R]. Both optimal in worst case space, processing time and query time.
13
Contributions
• Presenting the first randomized -approximation for the number of 1’s in a sliding window over the union of distributed streams
14
Related Work
• From the paper of Datar et al :
• Using Exponential Histogram data base
15
Exponential Histogram • Maintain more information about
recently seen items, less about old items.
• k0 most recent 1’s are assigned to individual bucket
• The K1 next most recent 1’s are assigned to bucket size 2.
• The K2 next most recent 1’s are assigned to bucket size 4.
• So on until last N items are assigned to some bucket
16
Exponential Histogram• Each ki is either or • The last bucket is discarded if its position no
longer falls within the window• If the new item is a 1, it is assigned to a new
bucket of size 1.
• If this make , then the two least recent buckets of size 1 are merged to form a bucket of size 2.
• If k1 in now too large, the two least recent buckets of size 2 are merged
• So on resulting in a cascading of up to log N bucket merges in the worst case.
• The approach using waves avoids this cascading
17
The Basic Wave
• Assumption : is an integer.• Counters:
1. pos - the current length of stream2. rank - the current number of 1’s in the stream.
• The wave contains the position of the recent 1’s in the stream, arranged at different “levels”.
• For i=1,2,..,l-1, level i contains the positions of the most recent 1-bits whose 1-rank is a multiple of
18
An Example for Basic Wave
• The crest of the wave is always over the largest 1-rank
• N=48, 1/E=3, l=5
19
Estimation Steps:• Let s=max(0,pos-n+1)
{estimation number of 1’s in [s,pos]}
• Let p1 be the maximum position less than s, and p2 the minimum position greater/equal then s.
• Let r1 and r2 be the rank-1 of p1 and p2 respectively.
• Return = rank-r+1 where r= r2 if r2-r1 =1 otherwise r=(r1+r2)/2
20
LEMMA 1
• The procedure returns an estimate that is within a relative error of E of the actual number of 1’s in the window.
21
Proof• Let j be the smallest numbered level containing
position p1.• By returning the midpoint of the range [r1,r2] , we
guarantee that the absolute error is at most (r2-r1)/2
• There is at most a gap between r1 and its next larger position r2.
• Thus the absolute error in our estimate is at most
• Let r3 be the earliest 1-rank at level j-1.• r3> r1, r3>=r2. • by definition
22
Improvement• Use modulo N’ counters for pos and
rank, store the positions in the wave as modulo N’ numbers - Take only log N’ bits.
• Keep track of both the largest 1-rank discarded (r1) and the smallest 1-rank (r2) still in the wave - Number of 1’s answer in O(1).
• Instead of storing a single position in multiple levels, store each position only at its maximal level.
23
Improvement
24
Improvement• The positions at each level are stored in a
fixed length queue so that each time new position is added , the position at the end of the queue is removed.
• Maintaining a doubly link list of the position in the wave in increasing order.
• By storing the difference between consecutive positions instead of the absolute positions - reduce the space from to
25
The deterministic wave algorithm• Upon receiving a stream bit b:
1.Increment pos (modulo N’=2N)2.If the head(p,r) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store r as the largest 1-rank discarded
• 3.If b=1 then do:(a)Increment rank, and determine the corresponding wave level j, the largest j such that rank is a multiple of (b)If the level j queue is full,discard the tail of the queue and splice it out of L(c)Add(pos,rank) to the head of the level j queue and the tail of L
26
Answering a query for a sliding window of size N:
• 1. Let r1 the largest 1-rank discarded. (If no such r1, return rank as exact answer.) Let r2 be 1-rank at the head of the linked list L. (If L is empty, return 0).
• 2. Return rank-r+1, where r=r2 if r2-r1=1 and otherwise r=(r1+r2)/2
27
• Space -
• Process time for each item - O(1)• Estimate time - O(1)
• In related work (Datar et al)
• Space -
• Process time for each item - O(log(EN))
28
Sum of Bounded Integers• The sum over a sliding window can
range from 0 to NR.
• Let N’ be smallest power of 2 greater than/equal to 2RN.
• Counters(modulo N’):pos - the current lengthtotal - the running sum
• l=log(2ENR) levels.
• Storing triple for each item (p,v,z)v-the value for the data itemz-the partial sum trough this item
29
• The answer for query is the midpoint of the interval [total-z2+v2,total-z1)
30
The Algorithm for the sum of last N items in a data stream• Upon receiving a stream value v between 0 to R:• 1.Increment pos (modulo N’=2N)• 2.If the head(p,v’,z) of the linked list L has expired
(p<=pos-N), then discard it from L and from its queue, and store z as the largest partial sum discarded
• 3.If v>0 then do:• (a)Determine the largest j such that some number in
(total,total+v) is a multiple of Add v to total.• (b)If the level j queue is full,discard the tail of the
queue and splice it out of L• (c)Add(pos,v,total) to the head of the level j queue and
the tail of L
31
Step 3a• The desired wave level is the largest
position j such that some number y in the interval (total,total+v] has 0’s in all positions less than j.
• y-1 and y differ in bit position j.
• If bit j changes from 1 to 0 at any point in [total,total+v],then j is not the largest
• j is the position of the most-significant bit that is 0 in total and 1 in total+v.
• j is the most -significant bit that is 1 in bitwise xor between total and total+v
32
Answering a query for a sliding window of size N :
• 1. Let z1 be the largest partial sum discarded from L. (If no such z1, return total as exact answer.) Let (pos,v2,z2) be the head of the linked list L. (If L is empty, return 0).
• 2. Return [total - (z1+z2-v2)/2]
33
• Space -O(1/E(logN+logR)) memory word of O(logN+logR)
• Process time for each item - O(1)• Estimate time - O(1)
• In related work (Datar et al)
• Space - O(1/E(logN+logR)) buckets of logN+log(logN+logR)
• Process time for each item - O(logN+logR)
34
Distributed Streams• Tree definitions for sliding window
over a collection of t>1 distributed
stream:1. Seeking the total number of 1’s in the last N
items in each of the t streams (tN items in total)
2. A single logical stream has been split
arbitrarily among the parties. Each party receives
items that include a sequence number in the
logical stream. Seeking the total number of 1’s in
the last N items in the logical stream.
3.Seeking the total number of 1’s in the last N
items in the position-wise union of the t streams
35
Solution for First Scenario :
• Applying single stream algorithm to each stream.
• To answer a query, each party sends its count to the Referee.
• The Referee sums the answers.
• Because each individual count is within E relative error, so is the total.
36
Solution for Second Scenario :
• To answer a query, each party sends its wave to the Referee.
• The Referee computes the maximum sequence number over all the parties use each wave to obtain an estimate over the resulting window, and sum the result.
• Because each individual count is within E relative error, so is the total.
37
Randomized Waves • Contains the positions of the recent 1’s in
the data stream, stored at different levels.• Each level i contains the most recently
selected positions of the 1-bits, where a position is selected into level i with probability
• The deterministic wave select 1 out of every 1-bits at regular interval.
• A randomized wave selects an expected 1 out of every 1-bits random interval.
• The randomize wave retains more position per level.
38
The Basic Randomized Wave• Let N’ be the power of 2 that is at least 2N• Let d=logN’• Let E<1 be the desired error probability• Each Party Pj maintains a basic
randomized wave for its stream consisting of d+1 queues, Qj(0),..,Qj(d), one for each level.
• Using a psedo-random hash function h to map positions to levels, according to exponential distribution
39
The Steps for Maintaining the Randomized Wave:• Party Pj, upon receiving a stream bit b:
1.Increment pos (modulo N’=2N)2.Discard any position p in the tail of a queue that has expired (p<=pos-N)3.If b=1 then for l= 0,..,h(pos) do:(a) If the level l queue Qj(l) is full, then discard the tail of Qj(l)(b) Add pos to the head of Qj(l).
• The sample for each level, stored in a queue, contains the most recent position selected into the level. (c=36)
40
• Consider a queue Qj(l) contains all the 1-bitwise the interval [I,pos] whose position i. Then Qj(l) contains all the 1-bits in the interval [i,pos] whose positions hash to a value greater than equal to l.
• As we move from level l to l+1, the range may increase.
• The queues at lower numbered levels may have ranges that fail to contain the window, but as we move to higher levels, we will find a level whose contains the window
41
Answering a query for a sliding window of size n<=N • After each party has observed pos bits:
1. Each party j sends its wave, {Qj(0),..,Qj(logN’))}, to the Referee, let s=max(0,pos-n+1). Then W=[s,pos] is the desired window.2.For j=1,..,t let lj be the minimum level such that the tail of Qj(lj) is a position p<=s.3.Let l*=max{lj},j=0,..,t. Let U be the union of all positions in Q1(l*),..Qt(l*).4. Return
42
• The algorithm returns an estimate for Union Counting Problem for any sliding window of size n<=N that is within a relative error E with probability greater than 2/3
• space -
Recommended