39
1 Dr. Panagiotis Symeonidis Data Engineering Laboratory http://delab.csd.auth.gr/ http://delab.csd.auth.gr/ ~symeon ~symeon Data Warehouse implementation: Part B

1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

Embed Size (px)

Citation preview

Page 1: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

1

Dr. Panagiotis SymeonidisData Engineering Laboratory

http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon

Data Warehouse implementation: Part B

Page 2: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

2

Cuboids Materialization as an Optimization Problem

Minimize: the average time taken to evaluate a view

Constraint: materialize a fixed number k of views

Greedy algorithm Best choice is given based on what has gone before It does not give the optimal solution

Page 3: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

3

Example of lattice of views diagram

psc

pc ps sc

p s c

p: parts: suppc: cust

Page 4: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

4

The lattice of views framework

if view V2 can be answered using results of view V1 then

V2 is descendent of V1 V1 is ancestor of V2

(denoted V2 ≼ V1)

E.g. (part) ≼ (part, cust)

Page 5: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

5

Some Definitions

K is the number of views to be materialized

C (v ) is the cost of view v Given

v is a view S is a set of views which are already selected to be

materialized The Benefit of selecting v for materialization is

B(v, S) = C(S) – C(S U v)

Page 6: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

6

Greedy Algorithm

S {top view}; For i = 1 to k do

Select that view v not in S such that B(v, S) is maximized;

S S U {v} Return S

Page 7: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

7

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

Benefit from pc =

Benefit

6M-6M = 0 k = 2

Page 8: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

8

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from ps =

Benefit

6M-0.8M = 5.2M k = 2

Page 9: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

9

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from sc =

Benefit

6M-6M = 0

0 x 3= 0

k = 2

Page 10: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

10

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from p =

Benefit

6M-0.2M = 5.8M

0 x 3= 0

5.8 x 1= 5.8

k = 2

Page 11: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

11

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from s =

Benefit

6M-0.01M = 5.99M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

k = 2

Page 12: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

12

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from c =

Benefit

6M-0.1M = 5.9M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

k = 2

Page 13: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

13

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from pc = 6M-6M = 0

0 x 2= 0

k = 2

Page 14: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

14

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from sc = 6M-6M = 0

0 x 2= 0

0 x 2= 0

k = 2

Page 15: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

15

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from p = 0.8M-0.2M = 0.6M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

k = 2

Page 16: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

16

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from s = 0.8M-0.01M = 0.79M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

k = 2

Page 17: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

17

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from c = 6M-0.1M = 5.9M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

k = 2

Page 18: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

18

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

Two views to be materialized are

1. ps2. c

V = {ps, c} Gain(V U {top view}, {top view})= 15.6 + 5.9 = 21.5

k = 2

Page 19: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

19

2nd Example of greedy algorithm

Initially, S = {a} k = 4 (select 3

more)

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 20: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

20

2nd Example of greedy algorithm

First choice b: 50 5 = 250 c: 25 5 = 125 d: 80 2 = 160 e: 70 3 = 210 f: 60 2 = 120 g: 99 1 = 99 h: 90 1 = 90

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 21: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

21

2nd Example of greedy algorithm

Second choice c: 25 2 = 50 d: 30 2 = 60 e: 20 3 = 60 f: (100-40) 1 + (50-40)

1= 60+10 = 70

g: 49 1 = 49 h: 40 1 = 40

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 22: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

22

2nd Example of greedy algorithm

Third choice c: 25 1 = 25 d: 30 2 = 60 e: (50-30) 2 + (40-30)

1=20 2 + 10 1 = 50

g: 49 1 = 49 h: 30 1 = 30

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 23: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

23

2nd Example of greedy algorithm

If we materialize only a then cost would be 8*100 =800

Now, cost is 800-250-70-60 = 420

a

b c

e

g

f

h

100

50 75

20 30 40

1 10

d

Page 24: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

24

Performance Study

How bad does the Greedy Algorithm perform?

Page 25: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

25

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from b =

Benefit

200-100= 100

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

k = 2

Page 26: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

26

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from c =

Benefit

200-99 = 101

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

k = 2

Page 27: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

27

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

k = 2

Page 28: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

28

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

Benefit from b = 200-100= 100

21 x 100= 2100

k = 2

Page 29: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

29

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

21 x 100= 2100

21 x 100= 2100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

k = 2

Page 30: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

30

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100 41 x 100= 4100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

21 x 101 + 20 x 1= 2141

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

k = 2

Page 31: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

31

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

Greedy

Optimal=

6241

8200=0.7611

If this ratio = 1, Greedy can give an optimal solution. If this ratio 0, Greedy may give a “bad” solution.

Does this ratio has a “lower” bound?

It is proved that this ratio is at least 0.63.

k = 2

Page 32: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

32

Indexing OLAP Data: Bitmap Index

Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Relation table Index on Region Index on Type

Page 33: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

33

Determining which materialized

cuboid(s) should be selected for

OLAP operations Query : Find the total sales group by {product-

category, province} with the condition “year =

2004”.

Which one of the 4 following materialized cuboids should be

selected to process the query?

1) {year, product, city}

2) {year, product-category, country}

3) {year, product-category, province}

4) {product, province} where year = 2004

Page 34: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

34

Solution:

1) {year, product, city}

– it can be used. However, it costs most because product and city are of

lower level

2) {year, product-category, country}

– it cannot be used because country is a more general concept than province

3) {year, product_category, province}

- it can be used. It could cost less than Solution 4, if there were no many

year values and there are many products for each product-category.

4) {product, province} where year = 2004

- it can be used.

Let the query to be processed be on {product_category, province} with the condition “year = 2004”, and there are 4 materialized cuboids available:

Page 35: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

35

Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times

select P.custid, P. item, sum(P.qty)

from Purchases P

group by P.custid, P.item

having sum (P.qty) > 5

Execution plan for the query?

The number of groups is very large but the answer to the query (the top of the iceberg) is usually very small

Iceberg queries

Page 36: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

36

select P.custid, P. item, sum(P.qty)

from Purchases P

group by P.custid, P.item

having sum (P.qty) > 5

select P.custid

from Purchases P

group by P.custid

having sum (P.qty) > 5

select P.item

from Purchases P

group by P.item

having sum (P.qty) > 5

Generate (custid, item) pairs only forcustid from Q1 and item from Q2

Q1 Q2

Iceberg queries

Page 37: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

37

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

Why online analytical mining?

High quality of data in data warehouses

OLAP-based exploratory data analysis

Easy selection of data mining functions

Page 38: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

April 21, 2023 38

An OLAM System Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 39: 1 Dr. Panagiotis Symeonidis Data Engineering Laboratory symeon Data Warehouse implementation: Part B

39

Dr. Panagiotis SymeonidisData Engineering Laboratory

http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon

Data Warehouse implementation: Part B