CS 345: Topics in Data Warehousing Thursday, October 28, 2004

CS 345:Topics in Data Warehousing

Thursday, October 28, 2004

Review of Tuesday’s Class

• Bitmap compression with BBC codes– Gaps and Tails– Variable byte-length encoding of lengths– Special handling of lone bits

• Speeding up star joins– Cartesian product of dimensions– Semi-join reduction– Early aggregation

Outline of Today’s Class

• Join Indexes

• Projection Indexes– Horizontal vs. Vertical decomposition

• Bit-Sliced Indexes– Fast bitmap counts and sums– Range queries

• Bit Vector Filtering

*O’Neil and Quass, “Improved Query Performance with Variant Indexes”, 1997

Join Indexes

• Generally, indexes are built on the tuples of a single relation

• Join indexes include attributes from more than one relation– Or else index entries consist of attributes from one

relation and RIDs from another

• Example:– Standard index on Customer.State

• Index entry consists of <State, Customer RID> pair

– Join index on Customer.State & Sales fact• Index entry consists of <State, Sales RID> pair

Defining Join Indexes

• Join indexes are like indexes on a materialized view– Except that the view isn’t actually materialized…

• Oracle syntax:– CREATE BITMAP INDEX ON Sales(Customer.State)FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_key

• Creates an index with <State, Sales RID> entries • Join condition used in joining the tables is specified at

index creation time

Why Join Indexes?

• Consider a star query plan using semi-join reductions– For each dimension:

• Determine keys of dimension rows that satisfy filter conditions• Join list of dimension keys to fact table (using index) • Result = list of fact RIDs

– Merge lists of fact RIDs from all dimensions– Retrieve matching fact rows– Join back to grouping dimensions– Perform grouping and aggregation

• Join indexes allow the join step to be skipped– Join is precomputed when join index created– Fact RID list stored directly in join index

• Bitmap join indexes are particularly common– Other types are possible (B-Tree, etc.)– Bitmap join indexes facilitate RID list merging / index intersection

Limitations of Join Indexes• Index creation cost is high

– Need to compute join result to create index• Index maintenance cost is high

– Usually not an issue in data warehouses– Data warehouse is “read-mostly”– Drop index before warehouse load begins– Re-create after load completes

• Can’t access non-indexed columns of dimension table– SELECT Customer.Age, SUM(Quantity)

FROM Sales, CustomerWHERE Sales.Customer_key = Customer.Customer_keyAND Customer.State = 'CA'

– Using ordinary index on State:• Get dimension table RIDs using index• Look up dimension rows to learn Age for each• Join to fact table to learn Quantity

– Using join index on State:• Get fact RIDs using index• Look up fact rows to learn Quantity• Join back to dimension table to learn Age

Horizontal vs. Vertical Decomposition

• Relations have rows and columns– “Two-dimensional” structure

• Disk access model is one-dimensional stream– Disk spins fast, disk arm moves slowly– Sequential I/O = leave the disk arm in one place– Essentially, disk provides one-dimensional locality

• Need to store the bytes in some particular order– Embedding 2D structure in 1D space

• Two options:– Horizontal decomposition (“row-major” order)– Vertical decomposition (“column-major” order)

Horizontal vs. Vertical

• Horizontal decomposition– All attributes of a record

are stored together– Values for the same

attribute but different records are separated

– Standard DBMS storage model

• Vertical decomposition– All values of an attribute

are stored together– Values for the same record

but different attributes are separated

Col1 Col2 Col3

1 4 7

2 5 8

3 6 9

Col1 Col2 Col3

1 2 3

4 5 6

7 8 9

Horizontal

Vertical

Horizontal vs. Vertical

• Horizontal decomposition– Good locality when:

• Multiple attributes from same record are co-accessed• Not too many records are co-accessed

– For example: Inserts and deletes• Entire record is in one place• Perform insert or delete with a single I/O

• Vertical decomposition– Good locality when:

• Multiple values of the same attribute are co-accessed• Not too many attributes are co-accessed

– Frequently the case for OLAP queries

Vertical Decomposition Example

• SELECT customer_key, incomeFROM CustomerWHERE state = 'CA'AND age > 50– Subplan for OLAP query that filters on state, age and groups on income– Suppose Customer has 100 attributes

• Horizontally decomposed data:– Scan table, one record at a time– Ignore 96 attributes of each record– Use 4 attributes (state, age, income, customer_key)– Lots of wasted I/O bandwidth

• Vertically decomposed data:– Parallel scans of customer_key, income, state, and age columns– If ith state = ‘CA’ and ith age > 50, then output ith customer_key and ith

income– No redundant attributes are read from disk– I/O is kept to a minimum

Projection Index

• The idea behind projection indexes– Databases usually store data in horizontal format– Vertical format is more efficient for many analysis queries– Why not do both?

• Projection index– Logically: Index entries are <Value, RID> pairs– Stored in same order as records in relation

• i.e. sorted by RID instead of sorted by Value– In practice: Storing RID is unnecessary

• Array storage format• Array index determined from RID

• Incremental approach to vertical decomposition– Vertical decomposition = complete set of projection indexes– A few projection indexes = partial vertical decomposition

Using Projection Indexes

• Consider a star query via semi-join decomposition– For each dimension:

• Determine keys of dimension rows that satisfy filter conditions

• Join list of dimension keys to fact table (using index) • Result = list of fact RIDs

– Merge lists of fact RIDs from all dimensions– Retrieve matching fact rows– Join back to grouping dimensions– Perform grouping and aggregation

Using Projection Indexes

• “Retrieve matching fact rows” step can be expensive– All we care about are values of aggregated columns– Other attributes are irrelevant (e.g. dimension foreign

keys)– Poor packing of relevant information into disk pages

• Relevant and irrelevant data is clumped together

• Using a projection index is often faster– List of fact RIDs tells us which index entries to read– Many index entries packed into same disk page

• No “clutter” from irrelevant fields

– Fewer I/Os required

Bit-Sliced Indexes• Bit-sliced indexes

– Generally used for measurement columns– Allow for:

• Efficient aggregation• Efficient range filtering (particularly for large ranges)

– Most suitable for bitmap-based plans– Requires positive integer-valued column

• Fixed-precision decimals OK• Example: Interpret $5.67 as 567 cents

• Bit-sliced index on attribute A– Treat A as multiple logical binary-valued columns

• Column A1 = Least significant bit of A• Column A2 = 2nd least significant bit of A• Etc.• Number of logical columns determined by max value

– Store each column as a separate bitmap

Bit-Sliced Index Example

Amount

5

13

2

6

7

Binary

0101

1101

0010

0110

0111

B4: 01000B3: 11011B2: 00111B1: 11001

Bit-Sliced Index

Fast Bitmap Aggregation

• Take advantage of word-level parallelism– Implicit parallelism arising from SIMD

operations in modern computer architectures– SIMD: Single Instruction, Multiple Data– Processor can compute bitwise operations on

all bits in a word at the same time– Index intersection via bitmap merge takes

advantage of this fact– It’s one reason for byte-aligned compression

Fast Bitmap Count

• Count the number of 1’s in a bitmap– Treat the bitmap as a byte array– Pre-compute lookup table with number of 1’s in each byte– Cycle through bitmap one byte at a time, accumulating count

using lookup table

• Pseudocode– count = 0;for (int i = 0; i < n/8; i++)

count += numSetBits[bitmap[i]];– numSetBits[0] = 0, numSetBits[7] = 3, etc.

• Treating bitmap as short int array → even faster– Lookup table has 65536 entries instead of 256– Bitmap of n bits → only add n/16 numbers

SUM using Bit-Sliced Index

• Suppose Bf represents the foundset

– foundset = List of fact RIDs that pass all filters

• For each bit slice Bi:

– Compute Bf AND Bi

– Do fast bitmap count of resulting bitmap– Multiply count by 2i

• Total = sum of weighted counts for all slices

SUM Example

Amount

5

13

2

6

7

B4: 01000B3: 11011B2: 00111B1: 11001

Bit-Sliced Index

Count of B4: 1Count of B3: 4Count of B2: 3Count of B1: 3

1*24 + 4*23 + 3*22 + 3*21

= 8 + 16 + 6 + 3 = 33

5 + 13 + 2 + 6 + 7 = 33

Range Filtering with Bit-Sliced Indexes

• Bit-sliced indexes allow range filtering

• Cost of applying range predicate independent of size of range– Not true for bitmap indexes, B-Trees

• We’ll give the algorithm for “A < c”– A is the attribute that is indexed– c is some constant– Other operations (>, =, etc.) are similar

Pseudocode for “A < c”• Set BLT = all zeros. Set BEQ = all ones.• For each bit slice Bi, from most to least significant:

– If bit i of constant c is 1:• BLT = BLT OR (BEQ AND NOT(Bi))• BEQ = BEQ AND Bi

– If bit i of constant c is 0:• BEQ = BEQ AND NOT(Bi)

• Return BLT.

• Why does it work?• Invariant: BEQ = 1 for all rows that match c on the

most significant bits (and only those rows)• A value x is less than c iff for some bit i:

– x and c agree on all bits more significant than i– The ith bit of x is 0, and the ith bit of c is 1

“Amount < 7” ExampleAmount

5

13

2

6

7

B4: 01000B3: 11011B2: 00111B1: 11001

Bit-Sliced Index

7 = 0111

BLT BEQ

00000 11111000000010010100

10110

101111001100011

00001

Bit-Sliced vs. Projection Index

• Both benefit from vertical decomposition– Values for 1 column are packed onto as few pages as

possible

• Both allow for fast aggregation• Both allow range filtering• Bit-sliced indexes can be faster when:

– Attribute values don’t use full data type precision– Query is CPU-bound

• Fewer machine instructions due to SIMD parallelism

• Bit-sliced indexes are more complicated

Bit Vector Filtering

• Sometimes called “Bloom Join”– From “Bloom Filters” [Bloom 1970]

• Bloom filter = cheap, approximate semi-join– Store bitmap with 1 bit per hash bucket– Initialize bitmap to all zeros– For each record in relation A:

• Compute hash value of join key• Set the bit for the appropriate hash bucket to 1

• In distributed database, send bitmap to Server 2.• For each record in relation B:

– Compute hash value of join key– Check whether bit for the appropriate bucket is set

• Yes → Record might join to something in A• No → Record definitely doesn’t join to anything in A

• Send qualifying B records to Server 1 & compute join

Bit Vector Filtering• Also useful in non-distributed database• Applies to multi-pass hash or merge join• Hash join

– Partition relation A– Partition relation B– Join each A partition with matching B partition

• Merge join– Generate sorted runs for relation A– Generate sorted runs for relation B– Merge A’s runs, Merge B’s runs, Merge A & B

• Bit vector filtering– While pre-processing A, generate Bloom filter– While pre-processing B, discard records that don’t match filter– Significantly reduce size of B at low cost– Since A is already being scanned, generating the Bloom filter is “free”– Bloom filter can be made as small or large as memory permits

• Fewer buckets → more collisions → less effective filtering• Fewer buckets → less memory to store bitmap

Next week: Physical DB Design

• This concludes the query processing topic• Next week, we’ll begin physical database design

– Selection of indexes and materialized views– Partitioning and data layout– RAID and hardware considerations– Physical design trade-offs– Database tuning

• Note on course project:– Start thinking about topics– We’ll discuss the project in more detail next class.

Documents

CS 345: Topics in Data Warehousing Thursday, October 28, 2004