89
Data Warehousing & Data Mining & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

  • Upload
    ngodiep

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

Data Warehousing & Data Mining& Data MiningWolf-Tilo BalkeSilviu HomoceanuInstitut für InformationssystemeTechnische Universität Braunschweighttp://www.ifis.cs.tu-bs.de

Page 2: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6. Optimization6.1 Multidimensional storage

6.2 DW Optimization / Indexes

Tree based indexes

Bitmap indexes

6. Optimization

Bitmap indexesHash Based

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Page 3: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• The basic storage structure is the multidimensional array

– Customized based upon

• The data e.g., sparse or dense

6.1 Multidimensional Storage

• The data e.g., sparse or dense

• Characteristics of the secondary memory e.g., block- or page- oriented

– Cube data cells are stored sequentially

• Multidimensional cubes are linearized

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

Page 4: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Linearization– For n dimensions D1to Dn there are |D1| * |D2| *…* |Dn|

addressed cube cells

– E.g., 2D cube• |D1| = 5, |D2| = 4, cube cells = 20

6.1 Multidimensional Storage

11 22 53 64 175 Jan (1)• |D1| = 5, |D2| = 4, cube cells = 20

– To access a measure in a cube we gothrough dimensions

• Sold Jackets in February are stored incube cell D1[4], D2[3]

• After linearization D1[4], D2[3] becomes array cell 14– This represents (Index(D2) – 1) * |D1| + Index(D1) (2 full rows + the

remaining 4 elements from the incomplete row)

– Linearized Index = 2 * 5 + 4 = 14

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

11

36 47

22 53

78 89

64

911

1116 1217

1012 1313

1518 1619

1414

175

1910

2515

2720

Jan (1)

Feb(2)

Feb(3)

Feb(4)

D1

D2

Page 5: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• General idea of linearization

– Considering a cube C=((D1, D2, …, Dn), (M1:Type1, M2:Type2, …, Mm:Typem)), the index of a cube cell z with coordinates (x1, x2, …, xn) can be linearized as

6.1 Multidimensional Storage

with coordinates (x1, x2, …, xn) can be linearized as follows:

• Index(z) = x1 + (x2 - 1) * |D1| + (x3 - 1) * |D1| * |D2| + … + (xn - 1) * |D1| * … * |Dn-1| =

= 1+ ∑i=1

n ((xi - 1) * ∏j=1

i-1 |Di|)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

Page 6: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Problems in array-storage– Influence of the order of the dimensions in the

cube definition• In the cube the cells of D2 are ordered

one under the other, e.g., Pants

6.1 Problems in Array-Storage

11 22 53 64 175 Jan (1)one under the other, e.g., PantsA query of sales of all pants involves a column in the cube

• After linearization, the information isspread among more data blocks or pages

• If we consider a data block can hold 5 cells, a query over all products sold in January can be answered with just 1 block read, but a query of all sold pants, involves reading 4 blocks

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6

11

36 47

22 53

78 89

64

911

1116 1217

1012 1313

1518 1619

1414

175

1910

2515

2720

Jan (1)

Feb(2)

Feb(3)

Feb(4)

D1

D2

Page 7: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• The problem of dimensions order can be diminished by using caching solutions

– Caching and swapping is performed alsoby the operating system

6.1 Problems in Array-Storage

by the operating system

– MDBMS has to manage its cachessuch that the OS doesn’t perform anydamaging swaps

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

Page 8: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Storage of dense cubes

– If cubes are dense, array storage is more efficient. However, operations suffer due to the large cubes

– The solution is to store dense cubes not linear but on

6.1 Problems in Array-Storage

– The solution is to store dense cubes not linear but on 2 levels

• The first contains an indexes and the second the data cells stored in blocks

• Different optimization procedures like indexes (trees, bitmaps), physical partitioning, and compression (run-length-encoding) can be used

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

Page 9: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Storage of sparse cubes– All the cells of a cube, including empty ones, have to

be stored– Sparseness leads to data being stored in many physical

blocks or pages• The query speed is affected by the large number of block

6.1 Problems in Array-Storage

• The query speed is affected by the large number of block accesses on the secondary memory

– Solutions:• Do not store empty blocks or pages: if there are large empty

portions of the array, they will not be physically stored, but the index structure will be adapted

• 2 level data structure: upper layer holds all possible combinations of the sparse dimensions, lower layer holds dense dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

Page 10: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• 2 level cube storage

6.1 Problems in Array-Storage

Marketing campaign

Customer

Sparse upper level

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

Product

TimeGeo

Dense low level

Page 11: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Indexes are used to optimize queries. OLAP queries have an aggregation role– How many articles from product group washing

devices were sold in 2008 for each month in each region

6.2 Indexes

region• Very big detailed data set (lots of sales in the sales fact

table)

• Such aggregation (2008, region) queries on big data sets are costly: e.g., consider 100 GB of sales data stored in a star schema; for this query the whole set needs to be read…at an average speed of 40 MB/s it still takes 43 minutes only to read the data

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

Page 12: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Restriction role

– Besides aggregation, restrictions are the most time-consuming

– Typologies of restrictions based on query types

Range query

6.2 Remember OLAP Queries?

• Range query

– Restricted through intervals in each dimension

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

Pro

du

ct (

Art

icle

)

Time (Days)

Page 13: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Partial range query– Some dimensions are not restricted

– Geometrically described as a sub-space

• Partial match query– Restricts more dimensions on a point,

6.2 Remember OLAP Queries?

Pro

du

ct (

Art

icle

)

Time (Days)

Pro

du

ct (

Art

icle

)

– Restricts more dimensions on a point,while other dimensions remainunspecified

– Geometrically described as a hyper-level in

the data set

• Point query– Restricted to a point on all dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

Pro

du

ct (

Art

icle

)

Time (Days)

Pro

du

ct (

Art

icle

)

Time (Days)

Page 14: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 Optimization Procedures

R1 R2

MV1

Base-layer of

the data

Layer of

materializationMV2

Logical

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

Layer of

partitioning

Layer of

Index struct.

P1

P2

P3

B*-Baum

kB-Baum

R*-Baum

UB-Baum

Logical

access paths

Physical

access paths

Page 15: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Why index?– Consider a 10 GB table; at 10 MB/s read speed we

need 17 minutes for a full table scan

– Consider an OLAP query: the number of Bosch S500 washing machines sold in Braunschweig last month?

6.2 Index Structures

washing machines sold in Braunschweig last month?• Applying restrictions (product, location) the selectivity

would be strongly reduced– If we have 30 location, 10000 products and 24 months

in the DW, the selectivity is

1/30 * 1/ 10000 * 1/24 = 0,00000014

– So…we read 10 GB for 1,4KB of data…not very smart

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

Page 16: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Reduce the size of read pages to a

minimum with indexes

6.2 Index Structures

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

Pro

du

ct (

Art

icle

)

Time (Days)

Full table scan

Pro

du

ct (

Art

icle

)

Time (Days)

Cluster primary

indexP

rod

uct

(A

rtic

le)

Time (Days)

More secondary

Indexes, bitmap indexes

Pro

du

ct (

Art

icle

)

Time (Days)

Optimal multi-

dimensional index

Scanned data Selected data

Page 17: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Types of queries

– k-nearest-neighbor-Search (k-NN-Search)

• Find the first K objects with the smallest distance to the query object

• Such a query is usually performed with approximations

6.2 Index Structures

• Such a query is usually performed with approximations (with an error rate)

– Reverse-nearest-neighbor-Search

• Find all the objects whose nearest neighbor is the queried object

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

Page 18: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Curse of Dimensionality (Richard Bellman)

– The volume increases exponentially with the dimensions number

– More dimensions mean more comparisons will be performed

6.2 Index Structures

performed

• At the present time there is no real scalable indexing

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

Page 19: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Classification criteria for index structures– Clustering

• Data with high probability of being used together

– Dimensionality• Refers to the number of attributes used to calculate the index

key

6.2 Index Structures

key

– Symmetry• The order of the index attribute is not performance relevant

– Tuple identifier (TID)• TID s are position numbers pointing to the physical storage place

of the corresponding data

– Dynamical behavior• Effort needed for dynamical modifications can strongly vary from

index to index

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

Page 20: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• In the beginning…there were

B-Trees

– Data structures for storing sorted

data with amortized run times for

6.2 Trees

data with amortized run times for

insertion and deletion

– Basic structure of a node

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

Tree Node

Node Pointers

Key Value Data Pointer

Page 21: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• B-Trees as primary indexes in OLTP

6.2 Trees

2 6 7

1 3 4 5 8 9

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

1,

Ad

am

s, $

88

7,0

0

2,

Be

rtra

m,

$1

9,9

9

3,

Be

ha

im,

$ 1

67

,00

4,

Ce

sar,

$ 1

86

6,0

0

5,

Mil

ler,

$1

79

,99

6,

Na

de

rs,

$ 6

82

,56

7,

Ru

th,

$ 8

64

2,7

8

8,

Sm

ith

, $

67

5,9

9

9,

Tarr

en

s, $

99

,00

Page 22: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• What’s wrong with B-Trees?

– K-D-B trees are useful for point data only

• Exact-point lookup!

• Not good for storing geometrical data and multi-dimensional data

6.2 B-Trees

dimensional data

– The R-Tree provided a way to do that (thanks to Guttman ‘84)

• R-tree represents data objects in intervals in several dimensions

– Exact-point and range lookups!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

Page 23: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Demands

– Can store d-dimensional hypercubes

– Performs point, line, and box queries as fast as possible...

6.2 R-Trees

– ...but also keeps memory usage in check!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

Page 24: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• R-Trees are recommended for lower dimensionality

– Up to 10 dimensions

• More scalable variants:

6.2 R-Trees

• More scalable variants:

– R+-Trees, R*-Trees und X-Trees

– Each up to 20 dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

Page 25: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• There is no total ordering of objects in the multidimensional space that preserves spatial proximity

– E.g. in time dimension it makes sense to keep data belonging to consecutive quarters clustered together

6.2 R-Trees

belonging to consecutive quarters clustered together

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

Page 26: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• R-Trees

– Can organize any-dimensional data

• Representing the data by a minimum bounding rectangle (MBR)

– Each node bounds it’s children

6.2 R-Trees

– Each node bounds it’s children

– A node can have many objects in it

• E.g.,

– node capacity of 3 (in praxis node

capacity is of 100s)

– 2 dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

Page 27: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• The leaves point to the actual objects (stored on disk probably)

• The height is always log n (it is height balanced)

6.2 R-Trees

Root

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

R1 R2

R3 R4 R5 R6

R1

R4

R3

R5

R6R2

R3.1 R3.2 R3.3 R4.1 R4.2 R4.3

R4.1

Points to data tuplesleafs

Non-leafs

Root

Page 28: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• 2 types of nodes

– Non-leaf nodes, contain entries of the form (I, P)

• I represents a vector I=(I1, I2, …, In) describing the boundaries of the element on the n dimensions, e.g., time and location, for R1 we have I=([08 Qtr1, 09 Qtr1],

6.2 R-Trees

and location, for R1 we have I=([08 Qtr1, 09 Qtr1], [a, b])

• P represents a pointer(e.g., disk page address)to the node with itschildren

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

R1 R2

R3 R4

R1

R4

R3

R5

R6R2

08 Qtr1

08 Qtr2

08 Qtr3

08 Qtr4

09 Qtr1

Time

Locationa b c d e f g

Page 29: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Leaf nodes, contain entries of the form (I, RID)

– I same as for non-leaf

– RID represents an unique tupleidentifier

6.2 R-Trees

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

R3.1 R3.2 R3.3

Points to data tuples

Page 30: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Operations

– Let:

• EI be the rectangle part of an index entry E

• Ep be the tupel-identifier or pointer

• S be the search rectangle

6.2 R-Trees - Search

• S be the search rectangle

– E.g., ([08 Qtr3, 09 Qtr1], [a, b])

• T be the root of the R-Tree

– Search:

• Start from the root node

• If multiple sub-trees contain the point of interest then follow all

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

Page 31: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Search (T, S)– If T is not a leaf

• Check each entry E to determine whether EI overlaps S

• For all overlapping entries, invoke Search(Ep, S)

– If T is a leaf

6.2 R-Trees - Search

If T is a leaf• Check all entries E to determine whether EI overlaps S

– If so, E is a qualifying record

• No good performance guarantees– In worst case all paths must be searched (due to

overlapping)

• Search algorithms try to cut out irrelevant regions („Pruning“)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

Page 32: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Search on sales on last 2 quarters of 08, locations e and f

6.2 R-Trees - Search

R1

R3

R5

08 Qtr3

08 Qtr4

09 Qtr1

Time

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

R1

R3 R4

R3.1 R3.2 R3.3 R4.1 R4.2 R4.3

R4

R6R2

08 Qtr1

08 Qtr2

08 Qtr3

a b c d e f g

R5.3 R6.3R6.2

Page 33: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Insert, general idea

– New index records are added to the leaves!

• Nodes that overflow are split

• Splits propagate up the tree

• Node splitting is not trivial

6.2 R-Trees - Insert

• Node splitting is not trivial

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

Page 34: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Insert (T, E)– Find position for new record

• Invoke ChooseLeaf to select a leaf node L in which to place E

– Add record to leaf node: • If L has room for E then insert E and return

6.2 R-Trees - Insert

If L has room for E then insert E and return

• Otherwise, invoke SplitNode to obtain L and LL containing E and all the old entries of L

– Propagate changes upwards• Invoke AdjustTree on L, also passing LL if a split was performed

– Grow tree taller• If node split propagation caused the root to split, create a new

root whose children are the two resulting nodes.

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

Page 35: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• ChooseLeaf(E)– Initialize

• Set N to be the root node

– Leaf check• If N is a leaf, return N

6.2 R-Trees

If N is a leaf, return N

– Choose sub-tree• Let F be the entry in N whose rectangle FI needs least

enlargement to include E

• Resolve ties by choosing the entry with the rectangle of smallest area

– Descend until a leaf is reached• Set N to be the child node pointed to by Fp and repeat from Leaf

check

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

Page 36: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Node splitting

– A full node contains M entries

– Divide the collection of M+1 entries between

2 nodes.

6.2 R-Trees - Insert

2 nodes.

– Objective: Make it as unlikely as possible for the resulting two new nodes to be examined on subsequent searches.

– Heuristic: The total area of two covering rectangles after a split should be minimized

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

Page 37: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Node splitting

6.2 R-Trees - Insert

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

Bad split Good split

Page 38: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Node splitting methods

– Exhaustive algorithm

• Generate all possible groups and choose the best with minimum area

• Number of possibilities ~ 2M-1

6.2 R-Trees - Insert

• Number of possibilities ~ 2M-1

• For M ~ 50 Number of possibilities ~ 600 Trillion

– Which is even more than theObama administration spentwith the crisis!!!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

Page 39: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Quadratic-cost algorithm

– A heuristic to find a small-area split

– Cost is quadratic in M and linear in the number of dimensions

6.2 R-Trees - Insert

– Pick two of the M+1 entries to be the first elements of the two new groups

• Calculate the MBR for each pair, and choose the one with the largest MBR

• These 2 objects are the new starting points for the resulting 2 nodes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

Page 40: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Quadratic-cost algorithm

– Assign remaining entries to groups one at a time

• Calculate d1 as the necessary volume difference to include the current entry in MBR1 and d2 for MBR2

• Calculate d1 and d2 for all the remaining entries

6.2 R-Trees - Insert

• Calculate d1 and d2 for all the remaining entries

• Insert the entry with the highest preference into the corresponding node

– Preference = max(d1, d2) – min(d1, d2)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

Page 41: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Linear cost algorithm

– Identical to Quadratic with the following differences:

• Uses a linear procedure to identify the starting entries

– Find in each dimension 2 rectangles

» the rectangle with the highest minimum coordinates

6.2 R-Trees - Insert

» the rectangle with the highest minimum coordinates

» and the rectangle with the lowest maximum coordinates

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

C

AD

B E

On X:

highest minimum coordinates -> E

lowest maximum coordinates -> A

On Y:

highest minimum coordinates -> D

lowest maximum coordinates -> C

Page 42: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

– Calculate the difference on the corresponding dimension between the coordinates of corresponding rectangles, and normalize it by the maximum on that dimension

– The two starting points are the two entries with the highest normalized difference

6.2 R-Trees - Insert

Dx = (Ex – Ax)/max(x)

– Order the next entries so that the volume growth is the smallest from one step to another

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

C

AD

B E

Dx = (Ex – Ax)/max(x)

Dy = (Dy - Cy)/max(y)

If (Dx > Dy) then choose E and A as starting points

Else choose D and C

Page 43: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Quadratic vs. Linear cost algorithm

– Quadratic:

• Choose two objects that create as much empty space as possible

– Linear:

6.2 R-Trees - Insert

– Linear:

• Choose two objects that are furthest apart

– Linear node-split is simple, fast, and as good as quadratic!

– Quality of the splits is slightly worse!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

Page 44: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Delete entry

– Start a normal search of the entry to delete (FindLeaf)

– Delete the record from the leaf (DeleteRecord)

– Condense Tree if needed (if there are now nodes

6.2 R-Trees - Delete

Condense Tree if needed (if there are now nodes which only have few entries)

• At condensation the node to be condensed is deleted as a whole and the entries which should remain are then inserted

• If the root has just one child, it will be the new root

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

Page 45: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Update

– If the datasets are updated the existent rectangles can be changed

– In this case the index entry must be deleted updated and inserted

6.2 R-Trees - Update

and inserted

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

Page 46: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Let

– M = Maximum number of entries in a node.

– m <= M/2

– N = Number of records

• Properties

6.2 R-Trees

• Properties

– Every leaf node contains between m and M index records

• Root node is the exception

– For each index record (I, RID) in a leaf node, I is the smallest rectangle that spatially contains the n dimensional data object represented in the indicated tuple

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

Page 47: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

– Every non-leaf node has between m and M children• Root node is the exception

– For each entry (I, P) in a non-leaf node, I is the smallest rectangle that spatially contains the rectangles in the child node

6.2 R-Trees

in the child node

– The root node has at least two children unless it is a leaf

– All leaves appear on the same level

– Height of a tree = ceiling(logm N)-1

– Worst case utilization for all nodes except the root is m/M

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

Page 48: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Advantages

– Efficient for non-point queries

– No downward cascading splits

– Guaranteed utilization

6.2 R-Trees

Guaranteed utilization

• Disadvantages

– Dimension dependent fan-out

– Overlapping regions - search performance problem

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

Page 49: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• R+-Trees enhances retrieval performance by avoiding visiting multiple pathswhen searching for point queries

– No overlap for MBRs at the samelevel (internal nodes)

6.2 R-Tree Variations

level (internal nodes)

– Specific object’s entry might beduplicated

– Insertions might lead to a series of update operations in a chain-reaction

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

Page 50: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Compared to R-trees,

– Nodes are not guaranteed to be at least half filled

– The entries of any internal node do not overlap

– An object ID may be stored in more than one leaf

6.2 R-Tree Variations

An object ID may be stored in more than one leaf node

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

A

B

C A, B and C do not overlap

Page 51: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• R+-Trees

– Advantages

• Because nodes are not overlapped with each other, point query performance benefits, e.g., a single path is followedand fewer nodes are visited than with the R-tree

6.2 R-Tree Variations

and fewer nodes are visited than with the R-tree

– Disadvantages

• Since rectangles are duplicated, an R+-Tree can be larger than an R-Tree built on same data set

• Construction and maintenance of R+ trees is more complex than the construction and maintenance of R-Trees

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

Page 52: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• R*-Trees

– Node split is more sophisticated

• When a node overflows, p entries are extracted and reinserted in the tree (p might be 25%)

– Considers minimization of:

6.2 R-Tree Variations

– Considers minimization of:

• Overlapping between minimum bounding rectangles at the same level

• Perimeter of the produced minimum bounding rectangles

– Insertion is more expensive while retrievals are faster

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

Page 53: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Combination of B*-Tree and Z-curve = Universal B-Tree (UB-tree)

• Z-curve is used to map multidimensional points to one-dimensional values (Z-values)

• Z-values are used as keys in B*-Tree

6.2 UB-Trees

• Z-values are used as keys in B*-Tree

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

Data part

8 178 17 39 5139 51

2828Index part

Page 54: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Concept of Z-Regions

– To create a disjunctive partitioning of the multidimensional space

– This allows for very efficient processing of multidimensional range queries

6.2 UB-Trees

multidimensional range queries

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

1

3 4

2 5

7 8

6

9

11 12

10 13

15 16

14

17

19 20

18 21

23 24

22

25

27 28

26 29

31 32

30

33

35 36

34 37

39 40

38

41

43 44

42 45

47 48

46

49

51 52

50 53

55 56

54

57

59 60

58 61

63 64

62

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

Page 55: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Z-Regions

– The space covered by an interval on the Z-Curve

– Defined by two Z-Addresses a and b

• We call b the region address of [a : b]

6.2 UB-Trees

– Each Z-Region maps exactly onto one page on secondary storage

• I.e., to one leaf page of the B*-Tree

– E.g., of Z-Regions

• [1:9], [10, 18], [19, 28]…

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

1

3 4

2 5

7 8

6

9

11 12

10 13

15 16

14

17

19 20

18 21

23 24

22

25

27 28

26 29

31 32

30

33

35 36

34 37

39 40

38

41

43 44

42 45

47 48

46

49

51 52

50 53

55 56

54

57

59 60

58 61

63 64

62

Page 56: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Z-Value address representation

– Calculated through bit interleaving of the coordinates of the tuple

6.2 UB-Trees

Tuple = 51, x = 4, y = 5

xX = 4 = 100Y = 5 = 101

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

1

3 4

2 5

7 8

6

9

11 12

10 13

15 16

14

17

19 20

18 21

23 24

22

25

27 28

26 29

31 32

30

33

35 36

34 37

39 40

38

41

43 44

42 45

47 48

46

49

51 52

50 53

55 56

54

57

59 60

58 61

63 64

62

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

y

xX = 4 = 100Y = 5 = 101

Z-value = 110010

Page 57: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Why Z-Values?

– With Z-Values we reduce the dimensionality of the data to one dimension

– Z-Values are then used as keys in B*-trees

Using B -Trees results in high node filling degree (at least

6.2 UB-Trees

• Using B*-Trees results in high node filling degree (at least 50%)

• Logarithmical complexity at search, insert and delete

– Guaranteed maximum node-accesses to locate a key is

– Z-Values are very important for range queries!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

� log��� −��� (� + 12 )�

Page 58: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Range queries (RQ) in UB-Trees

– Each query can be specified by 2 coordinates

• qa (the upper left corner of the query rectangle)

• qb (the lower right corner of the query rectangle)

– RQ-algorithm

6.2 UB-Trees

– RQ-algorithm

1. Starts with qa and calculates

its Z-Region

1. Z-Region of qa is [10:18]

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

1

3 4

2 5

7 8

6

9

11 12

10 13

15 16

14

17

19 20

18 21

23 24

22

25

27 28

26 29

31 32

30

33

35 36

34 37

39 40

38

41

43 44

42 45

47 48

46

49

51 52

50 53

55 56

54

57

59 60

58 61

63 64

62

Page 59: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Range queries (RQ) in UB-Trees2. The corresponding page is loaded and filtered with the

query predicate

1. Tupels 15 and 16 fulfill the predicate

3. The next region (inside the query rectangle) on the Z-

6.2 UB-Trees

3. The next region (inside the query rectangle) on the Z-curve is calculated

1. The next jump point on the Z-curve is 27

4. Repeat steps 2 and 3 until the

end-address of the last filtered

region is bigger than qb

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

1

3 4

2 5

7 8

6

9

11 12

10 13

15 16

14

17

19 20

18 21

23 24

22

25

27 28

26 29

31 32

30

33

35 36

34 37

39 40

38

41

43 44

42 45

47 48

46

49

51 52

50 53

55 56

54

57

59 60

58 61

63 64

62

Page 60: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• The critical part of the algorithm is calculating the jump point on the Z-curve which is inside the query rectangle

– If this takes too long it eliminates the advantage obtained through optimized disk access

6.2 UB-Trees

obtained through optimized disk access

– How is the jump point optimally calculated?

• From 3 points: qa, qb and the current Z-Region

• By performing bit operations

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

Page 61: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Range Queries, UB-Trees and DW

– In DW we have hierarchical organization of dimensions

– No intervals for hierarchical restrictions

6.2 UB-Trees

– Naive restrictions lead to many point queries instead of one interval on UB-Tree

– This is why we need Multidimensional Hierarchical Clustering (MHC)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

Page 62: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• With MHC UB-Trees can:

– Artificial encode hierarchies:

• Mapping of hierarchy restrictions to range restrictions

• Mapping is used for physical clustering of the fact table

– Increase computation and space efficiency

6.2 MHC

– Increase computation and space efficiency

• However, modification of query algorithms is necessary

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

Page 63: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 MHC

Category

Sector

VideoAudio

Brown Goods White Goods

ALL

...

Category

Sector

VideoAudio VideoAudio

Brown Goods White GoodsBrown Goods White Goods

ALL

...

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63

Item

Product

GroupCamcorder VCR

TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800

...

ID 2 11 5 8 21

Item

Product

GroupCamcorder VCR

TR-780 TRV-30TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800

...

ID 2 11 5 8 21

Page 64: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 MHC

...Category

Sector

VideoAudio

Brown Goods White Goods

ALL

...

...

0 1

00 01...Category

Sector

VideoAudio VideoAudio

Brown Goods White GoodsBrown Goods White Goods

ALL

...

...

0 1

00 01

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

Item

ProductGroup

Camcorder VCR

TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800

...

ID 2 11 5 8 21

0 1

Surrogate 00100000

0000 0001 0000 0001 0010

00100001 00110000 00110001 00110010

32 33 48 49 50

Item

ProductGroup

Camcorder VCR

TR-780 TRV-30TR-780 TRV-30 GR-AX 200 GV-500 SLV-E800

...

ID 2 11 5 8 21

0 1

Surrogate 00100000

0000 0001 0000 0001 0010

00100001 00110000 00110001 00110010

32 33 48 49 50

Page 65: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Bitmap Indexes

– Lets assume a relation Expenses

with three attributes: Nr, Shop

and Sum

6.2 Bitmap Indexes

Nr Shop Sum

1 Saturn 150

2 Real 65

3 P&C 160

4 Real 45and Sum

– A bitmap index for attribute

Shop looks like this

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65

4 Real 45

5 Saturn 350

6 Real 80

Value Vector

P&C 001000

Real 010101

Saturn 100010

Page 66: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• A bitmap index for an attribute of relation is:– A collection of bit-vectors

– The number of bit-vectors represents the number of distinct values of the attribute in the relation

– The length of each bit-vector is called the

6.2 Bitmap Indexes

– The length of each bit-vector is called the cardinality of the relation

– The bit-vector for value v has 1 in position i, if the ith

record has v in attribute A, and it has 0 otherwise

• Records are allocated permanent numbers

• There is a mapping between record numbers and record addresses

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66

Page 67: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Advantages

– Very efficient when used for partial match queries

– They offer the advantage of buckets

• In our example each index vector is a bucket

E.g., the Saturn bitmap vector is a bucket

6.2 Bitmap Indexes

– E.g., the Saturn bitmap vector is a bucket

of 2, telling us that records having value

Saturn in attribute Shop are first and

5th record in the table

– They can also help answer range queries

– Efficient hardware support for bitmap operations (AND, OR, XOR, NOT)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67

Value Vector

P&C 001000

Real 010101

Saturn 100010

Page 68: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Assume:– There are n records in the table

– Attribute A has m distinct values in the table

• The size of a bitmap index on attribute A is m*n• Significant number of 0’s is m is big, and of 1’s if m

6.2 Bitmap Indexes

m*n• Significant number of 0’s is m is big, and of 1’s if m

is small– Opportunity to compress

• Run Length Encoding (RLE)

• Gzip (Lempel-Ziv, LZ)

• Byte-Aligned Bitmap Compression (BBC): variable byte length encoding (Oracle patent)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68

Page 69: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Handling modification

– Assume record numbers are

not changed

– Deletion

6.2 Bitmap Indexes

Nr Shop Sum

1 Saturn 150

2 Real 65

3 P&C 160

4 Real 45

5 Saturn 350Deletion

• Tombstone replaces deleted record

• Corresponding bit is set to 0

• E.g. delete the 5th record

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69

6 Real 80

Value Vector

P&C 001000

Real 010101

Saturn 100010

Value Vector

P&C 001000

Real 010101

Saturn 100000

Before After

Page 70: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Insertion record is assigned thenext record number

– A bit of value 0 or 1 is appendedto each bit vector

– If new record contains a new value

6.2 Bitmap Indexes

Nr Shop Sum

1 Saturn 150

2 Real 65

3 P&C 160

4 Real 45

5 Saturn 350

– If new record contains a new valueof the attribute, add one bit-vector

• E.g., insert new record with REWE as shop

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70

6 Real 80

7 REWE 23

Value Vector

P&C 001000

Real 010101

Saturn 100010

Value Vector

P&C 0010000

Real 0101010

Saturn 1000100

REWE 0000001

BeforeAfter

Page 71: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Modification

– Change the bit corresponding tothe old value of the modified record to 0

– Change the bit corresponding tothe new value of the modified record to 1

6.2 Bitmap IndexesNr Shop Sum

1 Saturn 150

2 REWE 65

3 P&C 160

4 Real 45

5 Saturn 350

6 Real 80the new value of the modified record to 1

– If the new value is a new value of A, then insert a new bit-vector: e.g., replace Shop for record 2 to REWE

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 71

6 Real 80

Value Vector

P&C 001000

Real 010101

Saturn 100010

Value Vector

P&C 001000

Real 000101

Saturn 100010

REWE 010000

BeforeAfter

Page 72: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Select

– Basic AND, OR bit operations:

• E.g., select the sums we have spent inSaturn and P&C

6.2 Bitmap IndexesNr Shop Sum

1 Saturn 150

2 Real 65

3 P&C 160

4 Real 45

5 Saturn 350

6 Real 80Saturn OR P&C = Result

– Bitmap indexes should be used when selectivity is high

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72

6 Real 80

Value Vector

P&C 001000

Real 010101

Saturn 100010

Saturn OR P&C = Result

1 0 1

0 0 0

0 1 1

0 0 0

1 0 1

0 0 0

Page 73: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Advantages– Operations are efficient and easy to implement (directly

supported by hardware)

• Disadvantages– For each new value of an attribute a new bitmap-vector is

6.2 Bitmap indexes

– For each new value of an attribute a new bitmap-vector is introduced

• If we bitmap index an attribute like birthday (only day) we have 365 vectors: 365/8 bits ≈ 46 Bytes for a record, just for that

• Solution to such problems is multi-component bitmaps

– Not fit for range queries where many bitmap vectors have to be read

• Solution: range-encoded bitmap indexes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73

Page 74: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Multi-component bitmap indexes

– Encoding using a different numeration system

• E.g., for the month attribute, between 0 and 11 values can be encoded as x = 4 *y+z, where 0 ≤ y ≤2, and 0 ≤z ≤3, called <3,4> basis encoding

6.2 Bitmap indexes

x = 4 *y+z, where 0 ≤ y ≤2, and 0 ≤z ≤3, called <3,4> basis encoding

• 9 = 4*2+1

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 74

Month Dec Nov Oct Sep Aug Jul Jun Mai Apr Mar Feb Jan

M A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0

9 0 0 1 0 0 0 0 0 0 0 0 0

X Y Z

M A2,1 A1,1 A0,1 A3,0 A2,0 A1,0 A0,0

9 1 0 0 0 0 1 0

Page 75: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Advantage of multi-component bitmap indexes

– If we have 100 (0..99) different Days to index we can use a multi-component bitmap index with basis of <10,10>

6.2 Multi-component bitmap indexes

<10,10>

– The storage is reduced from 100 to 20 bitmap-vectors (10 for y and 10 for z)

– The read-access for a point (1 day out of 100) query needs however 2 read operations instead of just 1

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 75

Page 76: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Range-encoded bitmap indexes– Idea: set the bits of all bitmap vectors to 1 if they are

higher or equal to the given value

6.2 Bitmap indexes

Month Dec Nov Oct Sep Aug Jul Jun Mai Apr Mar Feb Jan

M A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0

0 1 1 1 1 1 1 1 1 1 1 1 1

– Query people born between March and August• For normal encoded bitmap indexes read 6 vectors, for range-

encoded indexes, we can solve the query with just 2 vectors read: ((NOT A2) AND A7)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 76

0 1 1 1 1 1 1 1 1 1 1 1 1

3 1 1 1 1 1 1 1 1 1 0 0 0

5 1 1 1 1 1 1 1 0 0 0 0 0

11 1 0 0 0 0 0 0 0 0 0 0 0

Page 77: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• If the query is limited only on one side, (e.g., persons born in or after March), 1vector is enough (NOT A1)

• For point queries, 2 vector reads are however

((NOT A ) AND A )

6.2 Range-encoded bitmap indexes

• For point queries, 2 vector reads are however necessary!

– E.g., persons born in March: ((NOT A1) AND A2)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 77

Page 78: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Combine multi-component bitmap indexes with range-encoding bitmap indexes and we have multi-component-range-encoding bitmap indexes

6.2 Bitmap indexes flavors

• Interval-encoded bitmap indexes

– Each bitmap-vector represents an interval

– It also needs to read at most 2 vectors, but the storage is half, compared to range-encoded bitmap indexes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 78

Page 79: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Partitions the range of key values for each key into several buckets

• Dynamic structure using a grid directory

– Grid array: a 2 dimensional array with pointers to

6.2 Grid Files

– Grid array: a 2 dimensional array with pointers to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1)

– Linear scales: two 1 dimensional arrays that used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 79

Page 80: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 Grid Files

XY

0 6 25 32

A

F

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 80

K

O

Z

Page 81: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Properties

– Supports multi-dimensional data, but not high number of dimension

– Every key is treated as primary key

6.2 Grid Files

– The index structure adapts itself dynamically to maintain storage efficiency

– Guarantee two disk accesses for point queries

– Values of key must be in linearly-ordered domain

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 81

Page 82: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Exact Match Search: at most 2 I/Os assuming linear scales fit in memory– First use liner scales to determine the index into the cell

directory– Access the cell directory to retrieve the bucket address

(may cause 1 I/O if cell directory does not fit in memory)

6.2 Grid Files: Queries

(may cause 1 I/O if cell directory does not fit in memory)– Access the appropriate bucket (1 I/O)

• Range Queries:– Use linear scales to determine the index into the cell

directory– Access the cell directory to retrieve the bucket addresses

of buckets to visit– Access the buckets

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 82

Page 83: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• E.g., Find 5<X<9 AND “Mat”<Y<“Robot”

6.2 Grid Files: Range queries

XY

0 6 25 32

A

F

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 83

F

K

O

Z

Page 84: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Determine the bucket into which insertion must occur

– If space in bucket, insert

– Else, split bucket

6.2 Grid Files: Insert

– If bucket split causes a cell directory to split do so and adjust linear scales

• Insertion of these new entries potentially requires a complete reorganization of the cell directory… expensive!!!

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 84

Page 85: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 Grid Files: Insert

a) b)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 85

a) b)

c) d)

Page 86: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 Grid Files: Insert

d) e)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 86

d) e)

f) g)

Page 87: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Delete the data node, and

– Merge data pages/blocks ifpossible

– Merge directory pages ifpossible

6.2 Grid Files: Delete

a)possible

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 87

a)

b) c)

Page 88: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

6.2 Grid Files: Delete

c) d)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 88

c) d)

e) f)

Page 89: Data Warehousing & Data Mining - TU · PDF file• Why index? – Consider a 10 GB ... – If we have 30 location, 10000 products and 24 months ... More secondary Indexes, bitmap indexes

• Optimization

– Partitioning

– Joins

– Materialized Views

Next lecture

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 89