Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
CAP 6930: Approximate Query Processing
New New SamplingSampling--BasedBased Summary StatisticsSummary Statistics for for
Improving Approximate Query AnswerImproving Approximate Query Answer
By
Gibbons and Matias
Presented by: Abhijit Pol
2
New Sampling-Based Summary Stats 2
CAP 6930: Approximate Query Processing
Outline …Outline …
ª A Framework For Approximate Query Answering
ª Approximate Answers Using Samples
ª Concise Samplesª Definitionª Algorithmsª Evaluation
ª Counting Samplesª Definitionª Algorithms ª Evaluation
ª Hot List Queries
3
New Sampling-Based Summary Stats 3
CAP 6930: Approximate Query Processing
A Framework for AQAA Framework for AQA
Data Warehouse
New Data
Queries
Responses
Data Warehouse
New Data
Queries
Responses
Approx.AnswerEngine
A Traditional System System Set-up for AQA
4
New Sampling-Based Summary Stats 4
CAP 6930: Approximate Query Processing
What are the goals?What are the goals?
ª Approximate but fast answers to queries over Accurate but slowanswering.
ª The avg student grade with 95% confidence is 2.85 ± 0.2 (10 s)ª The avg student grade is 2.9154444 (105 s)
ª Orders of magnitude less time than the time to compute an exact answer.
ª Save on costly disk I/Os
5
New Sampling-Based Summary Stats 5
CAP 6930: Approximate Query Processing
Where it fits?Where it fits?
ª Scenarios for which an exact answer may not required. ª Drill-down query sequence in ad-hoc data mining
ªGet the idea about an answer before putting query load. ª Provides feedback on how well query is
ª In query optimizer to estimate plan costs.ª Traditional role
6
New Sampling-Based Summary Stats 6
CAP 6930: Approximate Query Processing
How it works?How it works?
ª Engine maintains various summary statistics which are referred as Synopsis data structures
ª Synopses can be maintained by:
ª Observing the new data as it is loaded into the DWª Periodically returning to the DW to update informationª Returning to the DW at query time
ª Use synopses to report an approximate answer either in continuous or discrete manner.
7
New Sampling-Based Summary Stats 7
CAP 6930: Approximate Query Processing
How to evaluate it?How to evaluate it?
ª Coverage: The range of queries
ª Response Time: The time to provide an approx answer
ª Accuracy: The accuracy and confidence in that accuracy
ª Update Time: The overhead for up-to-date synopses
ª Footprints: The storage requirement for its synopsis
8
New Sampling-Based Summary Stats 8
CAP 6930: Approximate Query Processing
Approximate Answers Using SamplesApproximate Answers Using Samples
ª Synopsis can be:ª Sample basedª Histogram basedª Sketch based
ª Sample based Synopsis can be maintained as:ª Have a counter for # tuples in the relationª If relation has at most M tuples, store allª If relation has more than M tuples, store random sample of
size M
9
New Sampling-Based Summary Stats 9
CAP 6930: Approximate Query Processing
What is the big idea in paper?What is the big idea in paper?
ª The goal is to develop effective synopsis that capture important information about the data in a concise representation
ª Paper introduces two new sampling-based summary statistics, concise samples and counting samples.
ª They also presented new techniques for their fast incremental maintenance.
10
New Sampling-Based Summary Stats 10
CAP 6930: Approximate Query Processing
Quick definitionsQuick definitions
ª Sample Size:If S is a sample of N data points then the sample size |S| is nothing but the number of data points in S.
ª Footprint:Footprint of a sample S is its storage requirement in terms of number of data points.
ª For traditional sampling, e.g. reservoir sampling, Sample Size == Footprint
ª For concise and counting samplingSample Size >= Footprint
11
New Sampling-Based Summary Stats 11
CAP 6930: Approximate Query Processing
Concise SamplesConcise Samples
ª Definition # 1A concise sample is a uniform random sample of the data set such that values appearing more than once in the sample are represented as a <value, count> pair.
ª Represent ‘C’ copies of the same value ‘V’ as <V,C> pair; there by free-up space for ‘C-2’ additional sample points!!
12
New Sampling-Based Summary Stats 12
CAP 6930: Approximate Query Processing
An ExampleAn Example
Traditional Sample Concise Sample
SS 12 12FS 12 7
136377173177
<1,3><3,3>
6<7,5>
Can hold more samples for same footprint!!
13
New Sampling-Based Summary Stats 13
CAP 6930: Approximate Query Processing
Definition # 2Definition # 2
ª Let S = {(V1, C1),… , (Vj, Cj), Vj+1…Vl} be a concise sample, Then Sample Size (S) = l - j + ? Ci for i = 1 to j, and Footprint (S) = l + j
ª For m/2 distinct values => footprint at most m
ª Lemma#1: For any footprint m >= 2, there exists data sets for which the sample-size of a concise sample is n/m times larger than its footprint, where n is the size of the data set.
14
New Sampling-Based Summary Stats 14
CAP 6930: Approximate Query Processing
Algorithm: Obtaining a sampleAlgorithm: Obtaining a sample
ª Need a concise sample of footprint m from a relation R with n tuples residing on disk
For m times do: {Select a random tuple and extract R.A
}Semi-sort the set of values to produce (value, count) pairs. Sample until (All n data-points are seen || footprint size == m) {
For each new value sampled look-up for its existence in thecurrent concise sample If present as pair:
increment the count for a pairIf present as singleton:
convert a singleton to a (value, 2) pairElse: add a new singleton value
}
15
New Sampling-Based Summary Stats 15
CAP 6930: Approximate Query Processing
Algorithm: Obtaining a sampleAlgorithm: Obtaining a sample
ª Refer it as offline/static algorithm
ª Complexity of algorithm: T (Sample Size) disk accesses
ª Incremental (Online) maintenance algorithm with No Disk Accesses
ª In general can be used for concise sample in one sequential pass over a relation
16
New Sampling-Based Summary Stats 16
CAP 6930: Approximate Query Processing
Algorithm: Incremental maintenanceAlgorithm: Incremental maintenance
ª Concise sample with in a given footprint bound as new data inserted
ª Can Vitter’s idea be used to insert new data?
ª Problem of insertion is more difficult here –ª There you know the sample size in advanceª In concise sample the sample size is depend on data
distribution ª Any change in data distribution must be reflected in the
sampling frequency
17
New Sampling-Based Summary Stats 17
CAP 6930: Approximate Query Processing
Algorithm: Incremental maintenanceAlgorithm: Incremental maintenance
Set entry threshold T = 1; S be current concise sampleFor each new tuple ‘t’ selected for the sample do: {
With Pr(1/T) do: {Based on look-up of t.A in S either:create singleton && footprint++ ||create a pair && footprint++ ||increment the counter of a current pair
}} If footprint == pre-specified footprint then do: {
Evict()}
18
New Sampling-Based Summary Stats 18
CAP 6930: Approximate Query Processing
Algorithm: Incremental maintenanceAlgorithm: Incremental maintenance
Evict() {T = T’ // T’ > TFor each sample in S do: {
With Pr (T/T’) do: {If point is singleton evict it && footprint-- ||If point is <v,2> pair make it singleton && footprint-- ||If its pair decrement the counter
}//Note E(|S|) = |S| * (T/T’)If footprint == pre-specified footprint then do: {
Evict()}//Note subsequent inserts to S are done with 1/T’ not 1/T
}}
19
New Sampling-Based Summary Stats 19
CAP 6930: Approximate Query Processing
Theorem # 2Theorem # 2
ª For any sequence of insertions, the above algorithm maintains a concise sample.
ª Proof:ª Let T be the current thresholdªWe maintain the invariant that each tuple in R has been
treated with Tª The crux of proof is to show this invariant is maintained when
T is raised to T’ªWhen we do so we subject all samples in S in a for loop to
evict them with Pr (T/T’)
20
New Sampling-Based Summary Stats 20
CAP 6930: Approximate Query Processing
Theorem # 2Theorem # 2
ª Proof:ª Now for each tuple ‘t’ in R
ª If it was NOT in S before evictionªThen a coin with heads Pr(1/T) was flipped and failed to come up
head for this tuple ‘t’ªThus the same probabilistic event would fail to come up heads
with the new, stricter coin. (1/T’ < 1/T)
ª If it was in S before evictionªThen a coin with heads Pr(1/T) was flipped showed headª1/T * T/T’ = 1/T’. The result is that the tuple is in the sample with
probability I/T’. Thus the inductive invariant is indeed maintained.
21
New Sampling-Based Summary Stats 21
CAP 6930: Approximate Query Processing
How much to raise threshold?How much to raise threshold?
ª Large Raise:ª Evict more than is needed, smaller sample-sizeª Evict () run less frequently
ª Small Raise:ª No problem with sample-sizeª Likelihood that footprint won’t decrease and Evict () run more
frequently because of repeated increase in T
ª Typical (experimental) value 10%
22
New Sampling-Based Summary Stats 22
CAP 6930: Approximate Query Processing
Plugging in Vitter’s ideaPlugging in Vitter’s idea
ª Don’t flip for each insert flip it to determine how many inserts can be skipped before the next insert
ª X is a event of skipping exact i elements
Pr[ X ] = Pr [skip i elements in a row] * Pr[Pick i+1th]= ( 1- Pr[Pick a element]) i * Pr[Pick a element]= ( 1- 1/T) i * 1/T
ª As T gets large we save on the number of coin flips and hence the update time.
ª Also the probability of evicting a sample (T/T’) is typically small and we can save on coin flips and decrease the update time for evicting.
23
New Sampling-Based Summary Stats 23
CAP 6930: Approximate Query Processing
Online Algorithm: AnalysisOnline Algorithm: Analysis
ªO (Sample Size) in terms of coin flips before T was raised
ª After that T is raised by constant amount each time, we expect constant # coin tosses resulting in sample points being retained for each sample point evicted.
ª Thus we have an O (1) amortized expected update time per insert regardless of the data distribution.
24
New Sampling-Based Summary Stats 24
CAP 6930: Approximate Query Processing
Quantifying the SS advantageQuantifying the SS advantage
ªWe know for concise sampling, SS >= Footprint
ª The expected SS increases with the skew of data
ª For exponential distribution the advantage is exponential !!
ª Theorem # 3: Consider the family of exponential distributions: for i = 1,2,..., Pr(v = i) = a –i (a -1), for a > 1. For any footprint m >= 2, the expected sample-size of a concise sample with footprint m is at least a m/2
25
New Sampling-Based Summary Stats 25
CAP 6930: Approximate Query Processing
Quantifying the SS advantageQuantifying the SS advantage
ª Proof: At least a m/2 => Need lower bound
ª Expected SS can be lower bounded by expected number of randomly selected tuples before the (m/2 +1)th distinct tuple value v is selected.
ª Probability of selecting value > m/2 is:? a –i (a -1) for i = m/2+1 to 8 ,
= a m/2
ª So the expected number of tuples selected before such an event occurs is a m/2.
26
New Sampling-Based Summary Stats 26
CAP 6930: Approximate Query Processing
Quantifying the SS advantageQuantifying the SS advantage
ª Expected gain over traditional sample for arbitrary data sets.
ª Frequency Moment:Let A = {a1,a2,…an} where ai is member from N = {1,2..,n}Let mi = |j: aj = i| denotes # occurrences of i in Seq.Then Fk = ? m i
k for i = 1 to nF0 = distinct elements in sequenceF1 = length of sequenceF2 = Self join size of relation!!
27
New Sampling-Based Summary Stats 27
CAP 6930: Approximate Query Processing
Quantifying the SS advantageQuantifying the SS advantage
ª Theorem # 4: For any data set, when using a concise sample S with sample-size s, the expected gain is:E [s – # distinct value in S] = ? (-1) k (s!/ (s-k) ! k!) Fk/nk for k = 2 to s
ª All we are interested in finding out the expected value of number of distinct values in S
28
New Sampling-Based Summary Stats 28
CAP 6930: Approximate Query Processing
Quantifying the SS advantageQuantifying the SS advantage
ª Proof: Lets define Xi to be indicator random variable
ª Xi = 1, If the ith item selected to be in the traditional sample has a value not represented as yet in the sampleXi = 0, otherwise
ª X = ? Xi = Number of distinct values; for i = 1 to sE[x] = ? E[Xi] for i = 1 to s
ª Pr( Xi = 1) = ? Pj (1-Pj)i-1 Where Pj = nj/n be the prob. that an item selected at random from the set is of value j.
29
New Sampling-Based Summary Stats 29
CAP 6930: Approximate Query Processing
Experimental EvaluationExperimental Evaluation
ª Experiments evaluating the gain in the sample-size of concise samples over traditional samples
ª 500K new values were inserted with D potential number of distinct values (varied from 500 to 50K)
ªm value 100 and 1000 was used.
ª Compared for three algorithms: A traditional reservoir sampling, concise online algorithm, and concise offline/static algorithm
ª They used a large variety of Zipf data distributions with zipf parameter was from 0 to 3 in increments of 0.25
30
New Sampling-Based Summary Stats 30
CAP 6930: Approximate Query Processing
What is Zipf distribution?What is Zipf distribution?
ª Some applications of exponential distribution are known under the name Zipf’s Law
ª It is a discrete distribution with very similar idea of 80-20 rule.
ª P(x) = x –a where a is zipf parameter.
ª a = 0 => Uniform distribution and more the value of a more is the data skewed
31
New Sampling-Based Summary Stats 31
CAP 6930: Approximate Query Processing
Experimental EvaluationExperimental Evaluation
Experiment (a) m = 100 and (b) m = 1000
32
New Sampling-Based Summary Stats 32
CAP 6930: Approximate Query Processing
Experimental EvaluationExperimental Evaluation
Experiment (c) D/m = 50 and (d) D/m = 5
33
New Sampling-Based Summary Stats 33
CAP 6930: Approximate Query Processing
Experimental EvaluationExperimental Evaluation
ª Update time overheads:ª Coin flips for inserts and evictsª Look-up into current concise sample
34
New Sampling-Based Summary Stats 34
CAP 6930: Approximate Query Processing
Counting SamplesCounting Samples
ª A Counting samples are a variation on concise samples.
ª Keep track of all occurrences of a value inserted into the relation since the value was selected for the sample.
ª Definition 3: A counting sample for R.A with threshold T is any subset of R.A obtained as follows:
1. For each value v occurring c > 0 times in R, we flip a coin with probability 1 / T of heads until the first heads, up to at most c coin tosses in all; lf the ith coin toss is heads then v occurs c - i + 1 times in the subset, else v is not in the subset.
35
New Sampling-Based Summary Stats 35
CAP 6930: Approximate Query Processing
Counting SamplesCounting Samples
2. Each value v occurring c > 1 times in the subset is represented as a pair (v, c), and each value v occurring exactly once is represented as a singleton v.
ª Although counting samples are not uniform random samples of the base data, they can be used to obtain such a sample without any further access to the base data.
ª Can obtain a concise sample from a counting sample by considering each pair (v, c) in the counting sample in turn, and flipping a coin with probability 1/T of heads c - 1 times and reducing the count by the number of tails.