1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

1

Dr. Panagiotis SymeonidisData Engineering Laboratory

http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon

Data Warehouse implementation: Part B

2

Cuboids Materialization as an Optimization Problem

Minimize: the average time taken to evaluate a view

Constraint: materialize a fixed number k of views

Greedy algorithm Best choice is given based on what has gone before It does not give the optimal solution

3

Example of lattice of views diagram

psc

pc ps sc

p s c

p: parts: suppc: cust

4

The lattice of views framework

if view V2 can be answered using results of view V1 then

V2 is descendent of V1 V1 is ancestor of V2

(denoted V2 ≼ V1)

E.g. (part) ≼ (part, cust)

5

Some Definitions

K is the number of views to be materialized

C (v ) is the cost of view v Given

v is a view S is a set of views which are already selected to be

materialized The Benefit of selecting v for materialization is

B(v, S) = C(S) – C(S U v)

6

Greedy Algorithm

S {top view}; For i = 1 to k do

Select that view v not in S such that B(v, S) is maximized;

S S U {v} Return S

7

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

Benefit from pc =

Benefit

6M-6M = 0 k = 2

8

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from ps =

Benefit

6M-0.8M = 5.2M k = 2

9

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from sc =

Benefit

6M-6M = 0

0 x 3= 0

k = 2

10

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from p =

Benefit

6M-0.2M = 5.8M

0 x 3= 0

5.8 x 1= 5.8

k = 2

11

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from s =

Benefit

6M-0.01M = 5.99M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

k = 2

12

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from c =

Benefit

6M-0.1M = 5.9M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

k = 2

13

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from pc = 6M-6M = 0

0 x 2= 0

k = 2

14

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from sc = 6M-6M = 0

0 x 2= 0

0 x 2= 0

k = 2

15

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from p = 0.8M-0.2M = 0.6M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

k = 2

16

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from s = 0.8M-0.01M = 0.79M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

k = 2

17

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from c = 6M-0.1M = 5.9M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

k = 2

18

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

Two views to be materialized are

1. ps2. c

V = {ps, c} Gain(V U {top view}, {top view})= 15.6 + 5.9 = 21.5

k = 2

19

2nd Example of greedy algorithm

Initially, S = {a} k = 4 (select 3

more)

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

20


First choice b: 50 5 = 250 c: 25 5 = 125 d: 80 2 = 160 e: 70 3 = 210 f: 60 2 = 120 g: 99 1 = 99 h: 90 1 = 90

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

21


Second choice c: 25 2 = 50 d: 30 2 = 60 e: 20 3 = 60 f: (100-40) 1 + (50-40)

1= 60+10 = 70

g: 49 1 = 49 h: 40 1 = 40

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

22


Third choice c: 25 1 = 25 d: 30 2 = 60 e: (50-30) 2 + (40-30)

1=20 2 + 10 1 = 50

g: 49 1 = 49 h: 30 1 = 30

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

23


If we materialize only a then cost would be 8*100 =800

Now, cost is 800-250-70-60 = 420

a

b c

e

g

f

h

100

50 75

20 30 40

1 10

d

24

Performance Study

How bad does the Greedy Algorithm perform?

25

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from b =

Benefit

200-100= 100

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97

k = 2

26

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97


b

c

d

… … …

41 x 100= 4100

Benefit from c =

Benefit

200-99 = 101

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97

41 x 101= 4141

k = 2

27

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97


b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

k = 2

28

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97


b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

Benefit from b = 200-100= 100

21 x 100= 2100

k = 2

29

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97


b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

21 x 100= 2100

21 x 100= 2100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

k = 2

30

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97


b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100 41 x 100= 4100


21 x 101 + 20 x 1= 2141

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

k = 2

31

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

20 nodes …

p20 97

q1 97

…

q20 97

r1 97

…

r20 97

s1 97

…s20 97


Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

Greedy

Optimal=

6241

8200=0.7611

If this ratio = 1, Greedy can give an optimal solution. If this ratio 0, Greedy may give a “bad” solution.

Does this ratio has a “lower” bound?

It is proved that this ratio is at least 0.63.

k = 2

32

Indexing OLAP Data: Bitmap Index

Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Relation table Index on Region Index on Type

33

Determining which materialized

cuboid(s) should be selected for

OLAP operations Query : Find the total sales group by {product-

category, province} with the condition “year =

2004”.

Which one of the 4 following materialized cuboids should be

selected to process the query?

1) {year, product, city}

2) {year, product-category, country}

3) {year, product-category, province}

4) {product, province} where year = 2004

34

Solution:

1) {year, product, city}

– it can be used. However, it costs most because product and city are of

lower level

2) {year, product-category, country}

– it cannot be used because country is a more general concept than province

3) {year, product_category, province}

- it can be used. It could cost less than Solution 4, if there were no many

year values and there are many products for each product-category.

4) {product, province} where year = 2004

- it can be used.

Let the query to be processed be on {product_category, province} with the condition “year = 2004”, and there are 4 materialized cuboids available:

35

Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times

select P.custid, P. item, sum(P.qty)

from Purchases P

group by P.custid, P.item

having sum (P.qty) > 5

Execution plan for the query?

The number of groups is very large but the answer to the query (the top of the iceberg) is usually very small

Iceberg queries

36

select P.custid, P. item, sum(P.qty)

from Purchases P

group by P.custid, P.item


select P.custid

from Purchases P

group by P.custid


select P.item

from Purchases P

group by P.item


Generate (custid, item) pairs only forcustid from Q1 and item from Q2

Q1 Q2

Iceberg queries

37

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

Why online analytical mining?

High quality of data in data warehouses

OLAP-based exploratory data analysis

Easy selection of data mining functions

April 21, 2023 38

An OLAM System Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

39

Dr. Panagiotis SymeonidisData Engineering Laboratory

http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon

Data Warehouse implementation: Part B

Documents

1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B