Upload
shepry
View
54
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Approximate Query Processing using Wavelets. Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference, Cairo, Egypt Presented By Supriya Sudheendra. Outline. Introduction. - PowerPoint PPT Presentation
Citation preview
Kaushik Chakrabarti(Univ Of Illinois)Minos Garofalakis(Bell Labs)Rajeev Rastogi(Bell Labs)Kyuseok Shim(KAIST and AITrc)Presented at 26th VLDB Conference, Cairo, Egypt
Presented BySupriya Sudheendra
Outline
Introductiono Approximate Query Processing is a viable
solution for: Huge amounts of data High query complexities Stringent response-time requirements
o Decision Support Systems Support business and organizational decision-
making activities Helps decision makers compile useful
information from raw data, solve problems and make decisions
Introduction…o DSS users pose very complex queries to the
DBMS Requires complex operations over GB or TBs
of disk-resident data Very long time to execute and produce exact
answers Number of scenarios where users prefer a fast,
approximate answers
Prior Worko Previous Approximate query processing
techniques Focused on specific forms of aggregate queries Data reduction mechanism – how to obtain the
synopses of datao Sampling-based Techniques
A join-operator on 2 uniform random samples results in a non-uniform sample having very few tuples
For non-aggregate queries, it produces a small subset of the exact answer which might be empty when joins are involved.
Prior Work…o Histogram Based Techniques
Problematic for high-dimensional data Storage overhead High construction cost
o Wavelet Based Techniques Mathematical tool for hierarchical
decomposition of functions Apply wavelet decomposition to input data
collection –> data synopsis Avoids high construction costs and storage
overhead
Contribution of the Papero Viability and effectiveness of wavelets as a
generic tool for high-dimensional DSSo New, I/O-efficient wavelet decomposition
algorithm for relational tableso Novel Query processing algebra for Wavelet-
Co-Efficient Data Synopseso Extensive Experiments
Backgroundo Mathematical tool to hierarchically decompose
functionso Coarse overall approximation together with detail
coefficients that influence function at various scaleso Haar wavelets are conceptually simple, fast to
computeo Variety of applications like image editing and querying
One-Dimensional Haar Waveletso How to compute, given a data array:
Average the values together pairwise to get a “lower-resolution” representation of data
Detailed coefficients-> differences of the averaged value from the computed pairwise average
Reconstruction of the data array possible Why Detail Coefficients
One-dimensional Haar Wavelets
o Wavelet Transform: Overall average followed by detail coefficients in increasing order of resolution. Each entry->wavelet coefficient
o WA = [4, -2, 0, -1]
o For vectors containing similar values, most detail coefficients have small values that
can be eliminated Introduces only small errors
One-dimensional Haar Waveletso Overall average more important than any
detail coefficiento To normalize the final entries of WA, each
wavelet coefficient is divided by 2l
l: level of resolution WA = [4, -2, 0, -1/2]
Multi-dimensional Haar Waveletso Haar wavelets can be extended to multi-
dimensional array Standard Decomposition
Fix an ordering for the data dimensions(1,2,…d) Apply complete 1-D wavelet transform for each 1-d
row of array cells along dimension k
Nonstandard Decomposition Alternates between dimensions during successive
steps of pairwise averaging and differencing for each 1-D row of array cells along dimension k
Repeated recursively on quadrant containing all averages across all dimensions
Non-standard Decomposition
Pairwise averaging and differencing for one positioning of 2x2 box with root [2i1, 2i2]
Distribution of the results in the wavelet transform array
Process is recursed on lower-left quadrant of WA
Example Decomposition of a 4 X 4 Array
Multi-dimensional Haar coefficients: Semantics and Representationo D-dimensional Haar basis function
corresponding to Wavelet w is defined by: D-dimensional rectangular support region Quadrant sign information
Support Regions for 16 Nonstandard 2-D Haar Basis Function
Blank areas – regions of A whose reconstruction is independent of the coefficient
WA[0,0] – overall average WA[3,3] – contributes only to upper right
quadrant
Haar CoEfficients: Semantics and Representationo W = <R, S, v>
W.R – d-dimensional support hyper-rectangle of W encloses all cells in A to which W contributes Hyper-rectangle – represented by low and high
boundaries across each dimension j, 1<= j <=d W.R.boundary[j].lo and W.R.boundary[j].hi W contributes to each data cell A[i1,……id] where
W.R.boundary[j].lo <= ij <= W.R.boundary[j].hi for all j
o W.S – sign information for all d-dimensional quadrants of W.R Denoted by W.S.sign[j].lo and W.S.sign[j].hi
corresponding to lower and upper half of W.R’s extent along j
Computed as the product of d sign-vector entries that map to that quadrant
o W.v – scalar magnitude of W Quantity that W contributes to all data array
cells enclosed in W.R
Building Wavelet Coefficient Synopseso Relation R with d attributes X1, X2, ………Xd
o Can represent R as a d-dimensional array AR
o Jth dimension is indexed by the values of attribute Xj
o Cells contain the count of tuples in R having the corresponding combination of attribute values
o AR – joint frequency distribution of all attributes of R
Chunk-based organization of relational tablesJoint frequency array AR – split into d-
dimensional chunks Tuples of R of same chunk are stored
contiguously on diskIf R is not chunked, one extra pre-processing
step to reorganize R on disk
ComputeWavelet Algorithm
When a chunk is loaded for the first time, ComputeWavelet can perform entire computation for decomposing
Pairwise averaging and differencing is performed as soon as 2d averages are accumulated
Memory efficient- no more than one active sub-array at a time for each level of resolution
Processing Relational Queries in Wavelet Coefficient Domain
Wavelet-Coefficient Synopses
WT1, WT2,…WTk
RS of Wavelet Coefficients
WS
Approx. Result Relation
S
Wavelet-Coefficient Synopses
WT1, WT2,…WTk
Approximate Relations
T1, T2,….Tk
Approx. Result Relation
S
Op(WT1,….WTk)
Render(WS)
Render(WT1…WTk)
Op(T1, T2…. Tk)
Selection Operator
Our selection operator has the general form selectpred(WT ), where pred represents a generic conjunctive predicate on a subset of the d attributes in T; that is, pred = (li1 ≤ Xi1 ≤ hi1 ) ∧ . . . ∧ (lik ≤ Xik ≤ hik ), where lij and hij denote the low and high boundaries of the selected range along each selection dimension Dij , j = 1, 2, · · · , k, k ≤ d.
Selection - Relational Domain
o In relational domain, interested in only those cells inside query range
o In wavelet domain, interested in only the coefficients that contribute to those cells
Dim D1(Attr1)
Dim D2(Attr2)
Count
0 6 61 2 31 3 41 5 61 6 82 6 73 0 14 2 35 2 26 1 36 2 26 5 16 6 3
Dim. D2
6
3
73
322
4
1
1
86
3
Query RangeQuery Range
Dim.
D1
Joint Data Distribution ArrayJoint Data Distribution ArrayRelatioRelationn
Projection Operator
Projection- Wavelet Domain
Join Operator
Join Operator- Wavelet Domain
Experimental Studyo Improved answer qualityo Low synopsis construction costso Fast query execution
Query Execution Times
SELECT-JOIN-SUM
SELECT Query errors on real-life data
Conclusiono Multidimensional wavelets as an effective tool
for general purpose approximate query processing in modern, high dimensional applications
o The query processing algorithms operate directly on the wavelet-coefficient synopses of relational data, thus allowing for very fast processing of arbitrarily complex queries entirely in the wavelet-coefficient domain
o Extensive experimental study with synthetic as well as real-life data sets that verifies the effectiveness of the wavelet-based approach compared to both sampling and histograms