Upload
jasmine-rose
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
1
Efficient Computation of Frequent and Top-k Elements in Data Streams
Ahmed Metwally
Divyakant Agrawal
Amr El AbbadiDepartment of Computer Science
University of California, Santa Barbara
3
Motivation
Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks
stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he
will probably not click any displayed advertisement.– Show Pay-Per-Impression advertisements.
If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement.– Show Pay-Per-Click advertisements.
– Retrieve top advertisements to choose what to display.
4
Problem Definition
Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN
Top-k elements are the k elements with highest frequency
Both problems:– Very related, though, no integrated solution has been
proposed– Exact solution is O(min(N,A)) space
approximate variations
5
Practical Frequent Elements
-Deficient Frequent Elements [Manku ‘02]:– All frequent elements output should have
F > (φ - )N, where is the user-defined error.
φ N
(φ - ) N
6
Practical Top-k
FindApproxTop(S, k, ) [Charikar ‘02]:– Retrieve a list of k elements such that every
element, Ei, in the list has Fi > (1 - ) Fk, where Ek
is the kth ranked element.
F4
(1 - ) F4
7
Related Work
Algorithms Classification– Counter-Based techniques
• Keep an individual counter for each element• If the observed ID is monitored, its counter is updated• If the observed ID is not monitored, algorithm dependent
action
– Sketch-Based techniques• Estimate frequency for all elements using bit-maps of
counters• Each element is hashed into the counters’ space using a
family of hash functions.• Hashed-to counters are queried for the frequencies
8
Recent Work (Comparison)Algorithm Nature Space Bound Handles
CountSketch [Charikar ‘02]
Sketch O(k/2 log N/δ), δ is the failure probability
FindApproxTop(S, k, )
GroupTest [Cormode ’03]
Sketch O(φ-1 log(φ-1) log(|A|)) Hot Items
Frequent [Demaine ’02]
Counter O(1/), proved by [Bose ‘03]
FE
Probabilistic-Inplace [Demaine ’02]
Counter O(m), m is the available memory
FindCandidateTop(S, k, m/2)
Lossy Counting [Manku ’02]
Counter (1/) log(N) -Deficient FE
Sticky Sampling [Manku ’02]
Counter (2/) log(φ-1δ-1) -Deficient FE
9
Outline
Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
10
The Space-Saving Algorithm
Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate
for significant elements Keep track of max. possible errors
11
Space-Saving By ExampleElement
Count
error (max possible)
A B B A C A B B D D
Element A B C
Count 2 2 1
error (max possible) 0 0 0
Element A B C
Count 3 2 1
error (max possible) 0 0 0
Element B A C
Count 4 3 1
error (max possible) 0 0 0
Element B A D
Count 4 3 2
error (max possible) 0 0 1
Element B A D
Count 5 3 3
error (max possible) 0 0 1E
Element B E A
Count 5 4 3
error (max possible) 0 3 0
Space-Saving Algorithm– For every element in the stream S
– If a monitored element is observed• Increment its Count
– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error
Space-Saving Algorithm– For every element in the stream S
– If a monitored element is observed• Increment its Count
– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error
Space-Saving Algorithm– For every element in the stream S
– If a monitored element is observed• Increment its Count
– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error
Space-Saving Algorithm– For every element in the stream S
– If a monitored element is observed• Increment its Count
– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error
Space-Saving Algorithm– For every element in the stream S
– If a monitored element is observed• Increment its Count
– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error
C
Element B E C
Count 5 4 4
error (max possible) 0 3 3B
12
Space-Saving Observations
Observations:– The summation of the Counts is N
Element B E C
Count 5 4 4
error (max possible) 0 3 3
S = ABBACABBDDBEC N = 13
– Minimum number of hits, min ≤ N/m– In this example, min = 4
Element B E C
Count 5 4 4
error (max possible) 0 3 3
– The minimum number of hits, min, is an upper bound on the error of any element
Element B E C
Count 5 4 4
error (max possible) 0 3 3
13
Space-Saving Proved Properties
1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4.
S = ABBACABBDDBEC N = 13
Element B E C
Count 5 4 4
error (max possible) 0 3 3
2. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4.
Element B E C
Count 5 4 4
error (max possible) 0 3 3
S = ABBACABBDDBEC N = 13
16
Space-Saving Data Structure
We need a data structure that– Increments counters in constant time– Keeps elements sorted by their counters
We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]
18
Frequent Elements Queries
Traverse Stream-Summary, and report all elements that satisfy the user support
Any element whose
guaranteed hits = (Count – error) > φN
is guaranteed to be a frequent element
19
Frequent Elements Example
For N = 73, m = 8, φ = 0.15:– Frequent Elements should have support of 11 hits.– Candidate Frequent Elements are B, D, and G.
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1
– Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1
20
Frequent Elements Space Bounds
Space Bounds General Distribution Zipf(α)
Space-Saving O(1/) (1/)(1/α)
GroupTest O(φ-1 log(φ-1) log(|A|))
Frequent O(1/) proved by[Bose’03]
Lossy Counting (1/) log(N)
Sticky Sampling (2/) log(φ-1δ-1)
26
Top-k Elements Queries
Traverse the Stream-Summary, and report top-k elements.
From Property 2, we assert:– Guaranteed top-k elements:
• Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k.
– Guaranteed top-k’ (where k’≈k):• The top-k’ elements reported are guaranteed to be the
correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.
27
Top-k Elements Example
For k = 3, m = 8:– B, D, and G are the top-3 candidates.
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1
– B, and D are guaranteed to be in the top-3.
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1
– B , D, G and A are guaranteed to be the top-4. Here k’ = 4.
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1
– B , and D are guaranteed to be the top-2. Another k’ = 2.
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1
28
Top-k Elements Space Bounds
Space Bounds
General Distribution
Zipf(α)
Space-Saving
FindApproxTop(S, k, ):O(k/ * log(N))
Exact Top-k Problem:
α = 1: O(k2 log(A) )
α > 1: O((k/ α)(1/α) k )
CountSketch FindApproxTop(S, k, ):O(k/2 * log(N / δ))
FindApproxTop(S, k, ):α ≥ 1: O(k * log(N / δ))
32
Outline
Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion
33
Experimental Results - Setup
Synthetic data:– Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0– N = 107 hits.
Real Data (ValueClick, Inc.): Similar results Precision:
– number of correct elements found / entire output Recall:
– number of correct elements found / number of actual correct Run time:
– Processing Stream + Query Time Space used:
– Including hash table
34
Frequent Elements Results
Query: φ = 10-2, = 10-4, and δ = 10-2
We compared with– GroupTest and Frequent
All algorithms had a recall of 1.– That is, they all output the correct elements
among their output. Space-Saving was able to guarantee all
its output to be correct
35
Frequent Elements Precision
Precision for Frequent Elements (>100,000 Hits) on Synthetic Data
0 0
1111111 11111 1
0.833333
0.08890.05260.0707
0.2157
0.1053
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Pre
cis
ion
Space-Saving GroupTest Frequent
36
Frequent Elements Run Time
Run Time for Frequent Elements (>100,000 Hits) on Synthetic Data
4793745172 43844 43734 43141
27250272182590626125280152650024281
5003149578
6704759375167453103751228111906
0
10000
20000
30000
40000
50000
60000
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Ru
n T
ime (
ms)
Space-Saving GroupTest Frequent
37
Frequent Elements Space Used
Space Used for Frequent Elements (>100,000 Hits) on Synthetic Data
2796
58460
78460
38240
67756
165885636
168260168260 168260 168260 168260 168260168260
13760 13760 1376013760 13760
13760 13760
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Sp
ace U
sed
(B
yte
s)
Space-Saving GroupTest Frequent
38
Top-k Elements Results
Query: k = 100, = 10-4, and δ = 10-2
We compared with– CountSketch: CountSketch was re-run several
times. The hidden constant was estimated to be 16, in order to have output of competitive quality.
– Probabilistic-InPlace: was allowed the same number of counters as Space-Saving
Space-Saving was able to guarantee all its output to be correct
39
Top-k Elements Precision
Precision for Top-100 on Synthetic Data
1111111 11
0.1
0.920.98 0.99 0.99 11
0.020.020.0182
0.358423
0.133333
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Pre
cis
ion
Space-Saving CountSketch Probabilistic InPlace
40
Top-k Elements Recall
Recall for Top-100 on Synthetic Data
1 1 1 1
0.1
0.98 0.99 0.99 1 1
0.91
1 1 11 110.92
1 1 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Re
ca
ll
Space-Saving CountSketch Probabilistic InPlace
41
Top-k Elements Run Time
Run Time for Top-100 on Synthetic Data
1860453
848141768547 757922 754813
23531 26391 27984 26125 25703 25422 25390
1339343
1931797
32250297972898530078320783037527609
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Ru
n T
ime
(m
s)
Space-Saving CountSketch Probabilistic InPlace
42
Top-k Elements Space Used
Space Used for Top-100 on Synthetic Data
406330 407070 407070 407070 407010 406570 403930
67756
16588 6916 3436
5846078460
3824010874 3254
653439418 62674
1547020338
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
0 0.5 1 1.5 2 2.5 3
Zipf Alpha
Sp
ac
e U
se
d (
By
tes
)
Space-Saving CountSketch Probabilistic InPlace
44
Conclusion
Contributions:– An integrated approach to solve an interesting
family of problems– Strict error bounds using little space– Guarantees on results– Special attention was given to Zipfian data– Experimental validation
Future Work:– Incremental frequent and top-k elements reporting