March, 2002
Efficient Bitmap Indexing Techniques for Very Large Datasets
Kesheng John WuEkow Otoo
Arie Shoshani
March, 2002
Problem Statement
• Main objective: maps logical requests to qualified objects— A logical request:
• 20001015<=eventTime & 200<energy<300 …— Objects:
• Set of object ids; • Set of files containing the objects; • Offsets within the files, …
March, 2002
Application: STAROID dst hist mEvent
NumbermEventTime
mRunNumber
NLb
0 159625 159627 2635 20000827.011759
1239029 1341
1 159625 159627 2636 20000827.011759
1239029 1470
2 159625 159627 2637 20000827.011759
1239029 1663
OID n_clus_tpc_in[13]
numberOfPrimaryTracks
ChargedParticles_Means[1]
PrimaryVertexX
qxb[2] zdc2Energy
0 909 1228 266 .56 -26.40 48
1 1243 1415 317 .46 -29.08 53
2 1285 1533 281 .53 -6.754 8
A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.
March, 2002
Application: Combustion
• Direct numerical simulation of auto-ignition process (solution of complex partial differential equations)
• A dozen or more variables are computed at each time step and each grid point
• Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000
• Time steps: 100 >>> 1000s• Data size: 1 GB >>> 10 TB• Task: identify features and track them across
time steps• E.G. Find flame front across time
Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps
• Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000
March, 2002
Building a Bitmap Index
1. Partition each property into bins (binning)— e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)…
2. Generate a bit vector for each bin (encoding)— Bit i of bit vector j is 1 iff NLb[i] is in bin j
3. Compress each bit vector
000000000000000
000010001000000
000001110111011
101100000000000
010000000000000
000000000000100
000000000000000
property 1
000001110111011
101100000000000
010000001000000
000000000000100
000000000000000
property 2
000000000000000
000000001000000
000001110111011
101100000000000
010000000000000
000000000000100
000000000000000
property n
000010000000000
. . .
March, 2002
Advantages of Bitmap Index
• Bitmap index: specialized index that takes advantage— Read-mostly data: data produced from scientific
experiments can be appended in large groups• Fast operations
— “Predicate queries” can be performed with bitwise logical operations• Predicate ops: =, <, >, <=, >=, range,• Logical ops: AND, OR, XOR, NOT
— They are well supported by hardware• Easy to compress, potentially small index size• Each individual bitmap is small and frequently used ones
can be cached in memory
March, 2002
Operation-efficient Compression Methods
• Best known: byte-aligned bitmap code (BBC)— Uses run-length encoding (next slide)— Byte alignment, optimized for space efficiency— Decoding on bit level, not optimal for operations— Used in oracle
• We developed a new word-aligned scheme: WAH— Uses run-length encoding— Word alignment— Designed for minimal decoding to gain speed
March, 2002
Operation-efficient Compression Methods
Uncompressed:0000000000001111000000000 ......0000001000000001111111100000000 .... 000000
Compressed:12, 4, 1000,1,8,1000
Store very short sequences as-is
Advantage:
Can perform: AND, OR, COUNT operations on compressed data
Based on variations of Run Length Compression
March, 2002
Trade-off of Compression Schemes
uncompressedWAH
space
speed
better
gzip
BBC
ExpGolPacBits
March, 2002
Information About the Test Machines
• Hardware and system— Sun enterprise 450 (Ultrasparc II 400mhz)— 4GB RAM— VARITAS volume manager (stripped disk)
• Real application data from STAR— Above 2 million objects, 12 attributes
• Synthetic data— 100 million objects, 10 attributes
• Terms— Compression ratio: ratio of compressed bitmaps
size and uncompressed bitmaps size — Time reported are wall clock time in seconds
March, 2002
Logical Operation Time(Synthetic Data) 10X improvement
March, 2002
Logical Operation Time (STAR Data)Also 10X improvement
March, 2002
Encoding Schemes – Main Idea
Equalityencoding
Rangeencoding
Intervalencoding
12 bins 1 2 3 4 5 6 7 8 9 10 11 12
Interval, Range encoding: operates on 2 bins only!
March, 2002
Total Effect of Compression and Encoding Schemes
• Bottom line on queries— Compression scheme determines efficiency of
logical operations— Encoding scheme determines number of operations
• Range & interval – only one logical operation over 2 bitmaps
• Equality – many operations depending on number of bins— But, space may be a consideration
• What is the trade-off?
March, 2002
Interval Encoding Is Better Overall(WAH Compression)
Points on the graphs represent:10, 20, 30, 50, 100Bins.
Average time for random range queries
March, 2002
Timing Results
Method Index(X data)
Time (sec)
Speed
ORACLE Scan 0 6 0.1
B-tree 3.6 0.95 0.6
Native vertical partition
Scan 0 0.57 1
20 bins 0.18 0.11 5
50 bins 0.43 0.07 8
100 bins 0.90 0.05 11
March, 2002
Summary
• Compressed bitmap indices are effective for range queries
• Better compression scheme— 50% more space, but 12 time faster !!!
• Among the different encoding schemes— The interval encoding is the overall winner
March, 2002
Future Work
• Support NULL value and categorical values• On-line update: add new data and update index
without interrupting request processing• Recovery mechanism for robustness• Potential new applications: climate, astrophysics,
biology (microarrays)• Study non-uniform binning strategies• Study more encoding schemes• Integrate with conventional database system: to better
handle metadata, to provide more versatile front-end
March, 2002
How Many Bins for Continuous Domains?
Range(x)R
ange
(y)
Edge binEdge bin
.. ... ... ... ... ... .
.. ... ... ... ... ... ... ... ... ... .
.. ... .More bins
Less objects in edge bins
Searching edge bins: skip-scan over “attribute vertical partition”