Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Enhancing Bitmap Indices
Guadalupe Canahuate
The Ohio State University
Advisor: Hakan Ferhatosmanoglu
2
Introduction
n Bitmap indices:
¡ Widely used in data warehouses and
read-only domains
¡ Implemented in commercial DBMS such
as Oracle, Informix and Sybase
¡ Extremely efficient for execution of point
and range queries
3
Applications
n Data warehouses
n Scientific data
¡ High-energy physics: Simulations are
continuously run, and notable events are stored
with all the details.
¡ Climate modeling: sensor data.
¡ Astro-physics: telescopes devoted for observations.
n Visualization applicationsSupernova Observatory
4
Bitmap Indices
n Data is quantized into b categories using b bits
n Each tuple is encoded based on which category its
attribute belongs to:
¡ Bitmap Equality Encoding (BEE)
(Projection Index, Binning):
suited for point queries
¡ Bitmap Range Encoding (BRE):
suited for range queries
n Fast Bitwise Operations over bit vectors (AND, OR,
NOT, XOR)
1100102
1001003
1110011
321321
REEE
Value
5
Bitmap Compression
n Run-length encoders
¡ Runs of 0s à (0, run-count)
¡ Byte-aligned Bitmap Code (BBC) [Oracle ‘94]
¡ Word-Aligned Hybrid Code (WAH) [LBNL ‘02]
¡ No need to decompress for query processing
¡ Full scan of the bitmap is needed
6
Our Goal
n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)
n Improve compression
¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing
¡ Handle Missing Data (EDBT’06) n Query support / semantics
¡ Handle data updates (SSDBM’07)n Appends of new data
7
Our Goal
n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)
n Improve compression
¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing
¡ Handle Missing Data (EDBT’06) n Query support / semantics
¡ Handle data updates (SSDBM’07)n Appends of new data
8
Improving Compression by Data
Reorganization
n NP-Complete
n Most TSP heuristics are ineffective
n Minimize the hamming distance of adjacent numbers
n Gray-code: A space filling-curve for hamming space
n A scalable, in-place algorithm to gray-sort the tuples
Submitted to TKDE
9
Reordering Framework
Column
Ordering
Row
Ordering
Run Length
Encoding
Input Bitmaps
Compressed
Bitmaps
Reorder (A, start, end, b)
1: i ß start
2: j ß end
3: while i < j do
4: Decrement j until S(j,Cols[b]) = 0
5: Increment i until S(i,Cols[b]) = 1
6: if i < j then
7: Swap the ith and jth tuples on the table
8: end if9: end while
10: if b < no_of_bits then
11: Cols[b+1] = nextCol();
12: Reorder (A, start, j, b+1)
13: Reorder (A, j+1, end, b+1)
14: Reverse (j+1, end)
15: end if
EE and RE
Bitmaps
Col Ordering
Criteria
Word-Aligned
Hybrid (WAH)
Gray Code
Ordering
Lossless
compression
10
n Using Bitmap Data
¡ Number of Set Bits
n Static: Most 1s Asc/Desc
n Dynamic: Most 1s in the Longest Run
¡ Compressibility
n Static:
n Dynamic: Weighted Sum of Compressibility over the Gray code segments
n Using Query Workloads
¡ Most frequent accessed columns first
Column Ordering Criteria
iis
nC −=
2
11
0
10000
20000
30000
40000
50000
60000
70000
80000
1 11 21 31 41 51 61 71 81 91 101 111
Column
Siz
e (
32-b
it W
ord
s)
0
5000
10000
15000
20000
25000
30000
35000
40000
1 11 21 31 42 52 63 75 85 97 107 117
Column
Siz
e (
32-b
it W
ord
s)
Experimental Results
n 2-10x improvement
in compression
over already
compressed
bitmaps
n 4-7x improvement
in query execution
time
0
5000
10000
15000
20000
25000
30000
35000
40000
1 11 21 31 42 52 63 75 85 97 107 117
Column
Siz
e (
32-b
it W
ord
s)
WAH
Gray Code
Gray Code + Col Order
12
Our Goal
n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)
n Improve compression
¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing
¡ Handle Missing Data (EDBT’06) n Query support / semantics
¡ Handle data updates (SSDBM’07)n Appends of new data
13
Compression and Direct
Access
n Bitmaps need to be compressed:¡ Row numbers do not longer correspond to
the bit position in the bitmap
n Cannot pin-point a row¡ Need to scan the bitmap
n Enable direct access over the compressed bitmaps
n Trade-off accuracy vs. efficiency¡ No false negatives
VLDB 2006
14
Direct Access Over
Compressed Bitmaps
n Hashing/Bloom Filters¡ A 2m bit array indexed using k independent
hash functions
¡ False positives can happen, but false negatives cannot
¡ Only the set bits are inserted into the AB
¡ Three levels of encoding:n Per table, per attribute, per bitmap column
¡ Precision estimation
15
0
200
400
600
800
1000
1200
1400
1600
0 2000 4000 6000 8000 10000
# of Rows QueriedE
xe
c.
Tim
e (
ms
ec
)
WAH Uniform
AB Uniform
WAH Landsat
AB Landsat
WAH HEP
AB HEP
Direct Access Over
Compressed Bitmaps
n Execution time depends on the number of rows queried, not in the size of the bitmaps
n For queries over less than ~15% of the rows, execution time is up to 3 orders of magnitude faster than WAH
n Always set the array
size to be smaller than
WAH bitmap
16
Our Goal
n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)
n Improve compression
¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing
¡ Handle Missing Data (EDBT’06) n Query support / semantics
¡ Handle data updates (SSDBM’07)n Appends of new data
17
Handling missing data
n Incomplete databases are common in all research and industry domains
n Index performance degrades in the presence of missing data
n Missing data map to a distinguished category
n Query Semantics:¡ Missing is a match
n In a patient database retrieve the records whose first name is “John” and middle name “Paul”.
¡ Missing is not a matchn In a census database retrieve the interviewee that
answered “C” in question 4 and “A” in question 8
EDBT 2006
18
Handling Missing Data
Range EncodedRange EncodedEquality EncodedEquality Encoded
0
0
0
0
1
0
0
0
0
1
B1,5
0
1
0
0
0
0
1
0
0
0
B1,0
0
1
0
1
0
0
1
0
0
0
B1,1
1
1
0
1
0
0
1
0
1
0
B1,2
1
1
1
1
0
0
1
1
1
0
B1,3
1
1
1
1
0
1
1
1
1
0
B1,4
1
1
1
1
1
1
1
1
1
1
B1,5
00100210
00001missing9
0100038
0001017
0000056
1000045
00001missing4
0100033
0010022
0000051
B1,4B1,3B1,2B1,1B1,0ValueRecord
19
Query Processing with Missing Data
Range EncodedEquality Encodedv1 ≤ Ai ≤ v2 =
Missing is a Match
Missing is not a Match
20
Our Goal
n Enhance bitmap indices:¡ Reduce Index Size (ICDE’05, TKDE sub)
n Improve compression
¡ Direct access over compressed bitmaps (VLDB’06)n Bitmap Hashing
¡ Handle Missing Data (EDBT’06) n Query support / semantics
¡ Handle data updates (SSDBM’07)n Appends of new data
21
Handling Data Updates
n Look for this project in the poster session… J
SSDBM 2007
22
Conclusion
n Enhance bitmap indices:¡ Improve performance
n Better compression
n Direct access over compressed bitmap
¡ Overcome limitationsn Handle updates efficiently
n Support more types of queries
n Extend the applicability of bitmaps to more domains that can benefit from bitmaps good performance
23
Ongoing Work
n Support new types of queries
¡ Similarity Queries
n Integrate bitmaps to the coastal monitoring system (NSF Cyber Infrastructure Project)
n Bitmaps on emerging hardware technologies