42
1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

Embed Size (px)

Citation preview

Page 1: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

1

Vicky :: Cao Hui PingSherman :: Chow Sze Ming

CTH :: Chong Tsz HoRonald :: Woo Lok Yan

Ken :: Yiu Man Lung

ImplementingData Cubes Efficiently

Page 2: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

2

Content

Background Introduction of Datacube Problem defined Lattice model Greedy algorithm

How to do? How good? How bad ?

Evaluations Conclusion

Page 3: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

3

Background

DSS (Decision Support System)Gain competitiveness for business

Data warehouseMaintain historical informationUse “Data cube” to summarize results Identify trendsPerformance issue (time and space)Need to reuse result (materialization of views)

Page 4: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

4

Introduction of datacube Datacube

Dimensionality (number of GROUP-BYs)Aggregated data: Values in each cellDimension of datacube Detail of summaryHigher Dimension Higher detail

Common operationsDrill down: Look in more detailRoll up: Look in less detail

Page 5: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

5

What is a data cube?Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.

Page 6: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

6

Our problem

Physically materialize the whole data cubeBest query response Heavy pre-computing, large storage space i.e. Time efficient but space inefficient

Materialize nothingWorse query responseDynamic query evaluation, less storage space i.e. Space efficient but time inefficient

Page 7: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

7

Problem on materialized views

Materialize only part of the data cubeBalance the storage space and responseWhat is the best subject to materialize?Addressed in this paper

Source Size Time (sec) Ratio

From cell itself 1 2.07 N/A

View (s) 10,000 2.38 0.000031

View (p,s) 800,000 20.77 0.000023

View (p,s,c) 6,000,000 226.23 0.000037

Page 8: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

8

Data? View?

We use data cube to modify aggregate data.

So what we use to model view?

Lattice!

Page 9: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

9

Example of lattice diagram

8 possible grouping on the dimensionsp for Parts for Supplierc for Customer# of rows of data shown

next to the grouping

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

none 1

An example of Regular Lattice

Page 10: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

10

≼ operator

Suppose c d≼ The view d can be used to derive the view c c is the ancestor of d in lattice diagram

Impose a partial order on the views Usage on dimensions

(part) (part,customer) ≼ (part) (customer) ⋠

Usage within attribute value (year) (quarter) (month) (day)≼ ≼ ≼ (year) (quarter) (week) (day)≼ ≼ ≼

week month

day

year

quarter

An example of Irregular Lattice

Page 11: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

11

Regular lattices with equal domain size

Grouping attributes: A1,A2,…, An (domain: r) Attribute for aggregation: B Efficient algorithm

m: # of rows in top viewsk = log ⌈ r m⌉

Strategy k, j, and n Space Time

Space-optimal M m2n

Time-optimal k>j (2rr/(r+1))n (2rr/(r+1))n

k<j and k ≤ n/2 m m2n

k<j and k > n/2 m nCj rj

Page 12: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

12

The problem

The previous technique cannot be applied to the irregular lattices

Irregular lattices is common in data warehouse The optimization of views for irregular lattice is

NP-complete problem (inefficient!) Use Greedy Algorithm i.e. use heuristics to obtain approximate solution

Page 13: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

13

Greedy algorithm

Being as greedy as possible in each step!!

Simple example: Use the smallest number of coins to pay $50 cents

Suppose we have many coins of 20 cents, 10 cents and 5 cents.

Page 14: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

14

How to be greedy?

Common sense approach:Select the largest coin: 20 centsSelect the largest coin again: 20 centsRemaining amount = 50 – 20 – 20 = 10 centsWe cannot select the largest coin again.We choose the 2nd largest coin 10 cents instead.

Only 3 coins are needed! Optimal solution!

Page 15: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

15

Definition of “benefit of view”

C(v) denotes cost of view (v) B(v,S) denotes benefit of a view (v) relativ

e to a set of views (S) For each w v≼

Let u be the view of least cost in S such that w u≼

Bw = max{ C(u) – C(v) ,0}

B(v,S) = ∑w v≼ Bw

Page 16: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

16

Greedy algorithm

In each step Select the view with the most benefit Add it to the result

AlgorithmS={top view};for i=1 to k {

select view v not in S such that B(v,S) is maximizedS = S union {v}

}return S;

Page 17: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

17

Selecting the first view

After selecting coins, let us back to our problem, selecting views.

We must materialize the top view i.e. the view grouping by all attributesCannot be constructed from other viewsAvoid going to the raw data

Page 18: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

18

Selecting k views more

Space is limited! Suppose we can only select k more views.

For each view which is not yet selected, calculate the benefit of materializing it.

Pick the one with maximum benefit!!!

Let’s set k = 2 for examples.

Page 19: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

19

Example

a

b c

d e f

g h

100

50 75

20 40

30

1 10

E.g. The cost of constructing view b given the view A is 100

If we choose b to materialize, the new cost of constructing view b is 50.

Page 20: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

20

First round

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Notice that not only b, but also d, e, g and h can be calculated from b

So the total benefit is (100 – 50) x 5 = 250

Page 21: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

21

Continue… Similarly, the benefit

of materializing c is (100 – 75) x 5 = 125a

b c

d e f

g h

100

50 75

20 40

30

1 10

Benefit

b 250

c 125

Page 22: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

22

Not yet finish… For e,

Benefit =

(100-30) x 3

= 210

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Benefit

b 250

c 125

e 210

Page 23: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

23

Let’s choose b!

a

b c

d e f

g h

100

50 75

20 40

30

1 10

For d and f ,

Benefit =

(100-20) x 2

= 160 and

(100-40) x 2 =

120 respectively.

Benefit

b 250

c 125

d 160

e 210

f 120

Page 24: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

24

Next round?

Seems we should choose e, as it has the second largest benefit.

Let’s see what will happen in the second round. Benefit

b 250

c 125

d 160

e 210

f 120

Page 25: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

25

Second round!

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently calculated by using b)

Benefit

= (100 – 75) x 2 = 50

Benefit

c 50

Page 26: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

26

How about choosing f?

a

b c

d e f

g h

100

50 75

40

30

1 10

If we choose f, we found that h can be effectively calculated by using f instead of b.

Benefit

= (100 – 40) + (50 – 40)

Benefit

c 50

f 7020

Page 27: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

27

Easy to work out others

Benefit of d

= (50 – 20) x 2 = 60 Benefit of e

= (50 – 30) x 3 = 60 Benefit of g

= 50 – 1 = 49 Benefit of h

= 50 – 10 = 40

a

b c

d e f

g h

100

50 75

20 40

30

1 10

Page 28: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

28

Observation

In the first round, the benefit of choosing f (only 120) is far from the best choice (250)

But in second round, choosing f gives the maximum benefit!1st round Benefit

b 250

c 125

d 160

e 210

f 120

2nd round Benefit

c 50

d 60

e 70

f 70

g 49

Page 29: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

29

Simple? Optimal?

Trade off again! This simple algorithm is not optimal in all cases!

Consider the following case…

Page 30: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

30

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Page 31: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

31

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Choose c Benefit

= (200-99) x (1 + 20 + 20)= 4141= maximum

Page 32: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

32

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

Now choose either 1 of b and d (same benefit)

Page 33: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

33

Bad example

a

b dc

200

100 100

20 nodes

Total 1000

99

How about these? Very expensive!!!

Page 34: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

34

Optimal solution should be…

a

b dc

200

100 100

20 nodes

Total 1000

99

Only c is a little bit expensive.

Page 35: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

35

Some theoretical result

It can be proved that we can get at least (e – 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.

Page 36: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

36

Extensions (1)

ProblemThe views in a lattice are unlikely to have the

same probability of being requested in a query.

Solution:We can weight each benefit by its probability.

Page 37: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

37

Extensions (2)

Problem Instead of asking for some fixed number (k) of

views to materialize, we might instead allocate a fixed amount of space to views.

SolutionWe can consider the “benefit of each view per

unit space”.

Page 38: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

38

Conclusions

Materialization of views is an essential query optimization strategy for decision-support applications.

Reason to materialize some part of the data cube but not all of the cube.

A lattice framework that models multidimensional analysis very well.

Page 39: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

39

Conclusions (cont.)

Finding optimal solution in the case of irregular lattice is NP-hard.

Introduction of greedy algorithm Greedy algorithm work on this lattice and

pick the almost right views to materialize.

Page 40: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

40

Conclusions (the end)

There exists cases which greedy algorithm fails to produce optimal solution.

But greedy algorithm has guaranteed performance

Expansion of greedy algorithm.

Page 41: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

41

Reference

Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD’96:205-216.

Page 42: 1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently

42

Thank you~

Q & A Section