Upload
amberly-fox
View
221
Download
0
Embed Size (px)
Citation preview
1
Dr. Panagiotis SymeonidisData Engineering Laboratory
http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon
Data Warehouse implementation: Part B
2
Cuboids Materialization as an Optimization Problem
Minimize: the average time taken to evaluate a view
Constraint: materialize a fixed number k of views
Greedy algorithm Best choice is given based on what has gone before It does not give the optimal solution
3
Example of lattice of views diagram
psc
pc ps sc
p s c
p: parts: suppc: cust
4
The lattice of views framework
if view V2 can be answered using results of view V1 then
V2 is descendent of V1 V1 is ancestor of V2
(denoted V2 ≼ V1)
E.g. (part) ≼ (part, cust)
5
Some Definitions
K is the number of views to be materialized
C (v ) is the cost of view v Given
v is a view S is a set of views which are already selected to be
materialized The Benefit of selecting v for materialization is
B(v, S) = C(S) – C(S U v)
6
Greedy Algorithm
S {top view}; For i = 1 to k do
Select that view v not in S such that B(v, S) is maximized;
S S U {v} Return S
7
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
Benefit from pc =
Benefit
6M-6M = 0 k = 2
8
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit from ps =
Benefit
6M-0.8M = 5.2M k = 2
9
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit from sc =
Benefit
6M-6M = 0
0 x 3= 0
k = 2
10
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit from p =
Benefit
6M-0.2M = 5.8M
0 x 3= 0
5.8 x 1= 5.8
k = 2
11
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit from s =
Benefit
6M-0.01M = 5.99M
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
k = 2
12
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit from c =
Benefit
6M-0.1M = 5.9M
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
k = 2
13
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
Benefit from pc = 6M-6M = 0
0 x 2= 0
k = 2
14
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
Benefit from sc = 6M-6M = 0
0 x 2= 0
0 x 2= 0
k = 2
15
psc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
Benefit from p = 0.8M-0.2M = 0.6M
0 x 2= 0
0 x 2= 0
0.6 x 1= 0.6
k = 2
16
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
Benefit from s = 0.8M-0.01M = 0.79M
0 x 2= 0
0 x 2= 0
0.6 x 1= 0.6
0.79 x 1= 0.79
k = 2
17
1.1 Data Cubepsc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
Benefit from c = 6M-0.1M = 5.9M
0 x 2= 0
0 x 2= 0
0.6 x 1= 0.6
0.79 x 1= 0.79
5.9 x 1= 5.9
k = 2
18
psc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
1st Choice (M)
2nd Choice (M)
pc
ps
sc
p
s
c
0 x 3= 0
5.2 x 3= 15.6
Benefit
0 x 3= 0
5.8 x 1= 5.8
5.99 x 1= 5.99
5.9 x 1= 5.9
0 x 2= 0
0 x 2= 0
0.6 x 1= 0.6
0.79 x 1= 0.79
5.9 x 1= 5.9
Two views to be materialized are
1. ps2. c
V = {ps, c} Gain(V U {top view}, {top view})= 15.6 + 5.9 = 21.5
k = 2
19
2nd Example of greedy algorithm
Initially, S = {a} k = 4 (select 3
more)
a
b c
d e
g
f
h
100
50 75
20 30 40
1 10
20
2nd Example of greedy algorithm
First choice b: 50 5 = 250 c: 25 5 = 125 d: 80 2 = 160 e: 70 3 = 210 f: 60 2 = 120 g: 99 1 = 99 h: 90 1 = 90
a
b c
d e
g
f
h
100
50 75
20 30 40
1 10
21
2nd Example of greedy algorithm
Second choice c: 25 2 = 50 d: 30 2 = 60 e: 20 3 = 60 f: (100-40) 1 + (50-40)
1= 60+10 = 70
g: 49 1 = 49 h: 40 1 = 40
a
b c
d e
g
f
h
100
50 75
20 30 40
1 10
22
2nd Example of greedy algorithm
Third choice c: 25 1 = 25 d: 30 2 = 60 e: (50-30) 2 + (40-30)
1=20 2 + 10 1 = 50
g: 49 1 = 49 h: 30 1 = 30
a
b c
d e
g
f
h
100
50 75
20 30 40
1 10
23
2nd Example of greedy algorithm
If we materialize only a then cost would be 8*100 =800
Now, cost is 800-250-70-60 = 420
a
b c
e
g
f
h
100
50 75
20 30 40
1 10
d
24
Performance Study
How bad does the Greedy Algorithm perform?
25
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
1st Choice (M) 2nd Choice (M)
b
c
d
… … …
41 x 100= 4100
Benefit from b =
Benefit
200-100= 100
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
k = 2
26
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
1st Choice (M) 2nd Choice (M)
b
c
d
… … …
41 x 100= 4100
Benefit from c =
Benefit
200-99 = 101
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
41 x 101= 4141
k = 2
27
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
1st Choice (M) 2nd Choice (M)
b
c
d
… … …
41 x 100= 4100
Benefit
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
41 x 101= 4141
41 x 100= 4100
k = 2
28
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
1st Choice (M) 2nd Choice (M)
b
c
d
… … …
41 x 100= 4100
Benefit
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
41 x 101= 4141
41 x 100= 4100
Benefit from b = 200-100= 100
21 x 100= 2100
k = 2
29
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
1st Choice (M) 2nd Choice (M)
b
c
d
… … …
41 x 100= 4100
Benefit
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
41 x 101= 4141
41 x 100= 4100
21 x 100= 2100
21 x 100= 2100
Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241
k = 2
30
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
1st Choice (M) 2nd Choice (M)
b
c
d
… … …
41 x 100= 4100
Benefit
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
41 x 101= 4141
41 x 100= 4100 41 x 100= 4100
Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241
21 x 101 + 20 x 1= 2141
Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200
k = 2
31
1.1 Data Cubea 200
b 100 c 99 d 100
p1 97
20 nodes …
p20 97
q1 97
…
q20 97
r1 97
…
r20 97
s1 97
…s20 97
Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241
Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200
Greedy
Optimal=
6241
8200=0.7611
If this ratio = 1, Greedy can give an optimal solution. If this ratio 0, Greedy may give a “bad” solution.
Does this ratio has a “lower” bound?
It is proved that this ratio is at least 0.63.
k = 2
32
Indexing OLAP Data: Bitmap Index
Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer
RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1
RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0
Relation table Index on Region Index on Type
33
Determining which materialized
cuboid(s) should be selected for
OLAP operations Query : Find the total sales group by {product-
category, province} with the condition “year =
2004”.
Which one of the 4 following materialized cuboids should be
selected to process the query?
1) {year, product, city}
2) {year, product-category, country}
3) {year, product-category, province}
4) {product, province} where year = 2004
34
Solution:
1) {year, product, city}
– it can be used. However, it costs most because product and city are of
lower level
2) {year, product-category, country}
– it cannot be used because country is a more general concept than province
3) {year, product_category, province}
- it can be used. It could cost less than Solution 4, if there were no many
year values and there are many products for each product-category.
4) {product, province} where year = 2004
- it can be used.
Let the query to be processed be on {product_category, province} with the condition “year = 2004”, and there are 4 materialized cuboids available:
35
Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times
select P.custid, P. item, sum(P.qty)
from Purchases P
group by P.custid, P.item
having sum (P.qty) > 5
Execution plan for the query?
The number of groups is very large but the answer to the query (the top of the iceberg) is usually very small
Iceberg queries
36
select P.custid, P. item, sum(P.qty)
from Purchases P
group by P.custid, P.item
having sum (P.qty) > 5
select P.custid
from Purchases P
group by P.custid
having sum (P.qty) > 5
select P.item
from Purchases P
group by P.item
having sum (P.qty) > 5
Generate (custid, item) pairs only forcustid from Q1 and item from Q2
Q1 Q2
Iceberg queries
37
From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
OLAP-based exploratory data analysis
Easy selection of data mining functions
April 21, 2023 38
An OLAM System Architecture
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
39
Dr. Panagiotis SymeonidisData Engineering Laboratory
http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon
Data Warehouse implementation: Part B