36
MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

Embed Size (px)

Citation preview

Page 1: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

MDL Summarization with Holes

Shaofeng Bu Laks V.S. Lakshmanan

Raymond T. Ng

University of British Columbia, Canada

Page 2: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 2

Introduction Multi-dimensional OLAP queries typically produce

data intensive answers Often the question is: how to express the large

answer set of cells that satisfy the OLAP query conditions: Simple enumeration: accurate but not necessarily the most

intuitive; Summaries: not (necessarily) 100% accurate but can be

more intuitive and informative. Summarized answers can be more easily understood

Page 3: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

3

OLAP Data Cube Example

clothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

nort

hw

est

mid

wes

tnort

heast

locationja

ckets

tops

wom

en’s

jeans

blo

use

s

skir

ts

form

al w

ear

men

’s jean

s

dre

ss p

an

ts

ties

dre

ss s

kirt

s

women’s men’s Each dimension is associated with a hierarchical tree

Page 4: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

4

OLAP Data Cube Example

clothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

nort

hw

est

mid

wes

tnort

heast

locationja

ckets

tops

wom

en’s

jeans

blo

use

s

skir

ts

form

al w

ear

men

’s jean

s

dre

ss p

an

ts

ties

dre

ss s

kirt

s

women’s men’s

Data Cell: (c1,c2), c1,c2 are leaf-nodes

in axis-trees, e.g. (Vancouver, ties) Data Region: describes all data cells

covered by given nodes in the axis-trees, (x1, y1), e.g.:

(Vancouver, ties) (Vancouver, women’s) (northwest, women’s)

Page 5: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

5

OLAP Data Cube Example

clothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

nort

hw

est

mid

wes

tnort

heast

locationja

ckets

tops

wom

en’s

jeans

blo

use

s

skir

ts

form

al w

ear

men

’s jean

s

dre

ss p

an

ts

ties

dre

ss s

kirt

s

women’s men’s

Blue cells: the cells that satisfy the query conditions;

How to find a summary of the blue cells in a data cube?

Page 6: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 6

MDL Summarization

MDL: Minimum Description Length Use regions to cover the blue cells; Length of an MDL description is the number of

included regions and cells; MDL is to find the description with the

minimum length.

Page 7: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

7

R9

R5R6

R7 R8

R1

An Example of MDL Summarizationclothes

R2 R3 R4

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

mid

wes

tnort

heast

location

jack

ets

tops

wom

en’s

jeans

blo

use

s

skir

ts

form

al w

ear

men

’s jean

s

dre

ss p

an

ts

ties

dre

ss s

kirt

s

women’s men’snort

hw

est

Page 8: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

8

?R9

R10

R11

R12

R13

R5

10 regions

8 single blue cells

Total length = 18

MDL Summarization

R6R7 R8

A Motivating Example: A New Caseclothes

R2 ?R3 R4

?R1

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

nort

hw

est

mid

wes

tnort

heast

loca

tion

jack

ets

tops

wom

en’s

jeans

blo

use

s

skir

ts

form

al w

ear

men

’s jean

s

dre

ss p

an

ts

ties

dre

ss s

kirt

s

women’s men’sNot blue cells any more

Page 9: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 9

Can we do better?

Yes! We present a new compression approach: MDL with Holes:

Identify regions with blue cells, even if they contain non-blue cells;

Express the included blue cells by using regions with the exception of the covered non-blue cells;

Non-blue cells are called holes.

Page 10: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

10

R5R6

R7 R8

R2 R4 Plus other 6 regions?R1

R1-(Vancouver,Skirts)

?R9

R9-(Boston,ties) -(New York, dress skirts)

?R3

R3-(Vancouver,Skirts)

A Motivating Example: MDL with Holesclothes

New York

Vancouver

Edmonton

San Jose

San Francisco

Chicago

MinneapolisBoston

Summit

Albany

nort

hw

est

mid

west

nort

heast

loca

tion

jack

ets

tops

wom

en’s

jeans

blo

use

s

skir

ts

form

al w

ear

men

’s jean

s

dre

ss p

an

ts

ties

dre

ss s

kirt

s

women’s men’s

R1+R3-(Vancouver,Skirts)

MDL with Holes: Length = 6+3+3=12

MDL Approach: Length is 18

Page 11: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 11

Problem Statements

MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit.

In practice, we can drill down on regions to get additional details.

Page 12: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 12

Definitions: Length & Benefit

Given a set B of data cells (blue cells), an MDLH description for B:

D=S – H , S is a set of data regions, H is a set of data cells, also called ‘holes’, D covers exactly the data cells in B.

Length: total number of the included regions and cells in the description.

|D|=|S|+|H| Benefit : how much shorter is the MDLH

summary than the enumeration of B.

Benefit (D) = |B| – | D|

B1={a, b, c} D1= s – d

|D1|=2

Benefit(D1) = |B1| - |D1| = 1

B2={e, g} D2= t – f – h

|D2| = 3

Benefit(D2)= |B2| - |D2| = -1

a b c d e f

s t

x

g h

Page 13: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

13

Related Work The Generalized MDL Approach for Summarization, Laks

V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 Reduce description length by allowing non-blue cells to be covered

in the regions The regions are not pure.

Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003

Allow Cartesian products to be formed; Not purely hierarchical: NP Completeness result is less surprising; What about the pure hierarchical?

Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001

Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.

Page 14: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 14

Outline Introduction to MDL with Holes

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Complete Heuristics

A Greedy Heuristic Dynamic Programming Quadratic Programming

Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

Page 15: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

15

‘x’ D1= x – d – f – j

Benefit(D1) = 7 – 4 = 3

D2=(s – d ) + e + ( u – j )

Beneift(D2) = 7 – 5 = 2

‘y’ D3 = y – m – p – q – r

Benefit(D3) = 4 – 5 = -1

D4 = ( v – m ) + o ,

Benefit(D4) = 4 – 3 = 1

‘z’ D5 = z – d – f – j – m – p – q – r

Benefit(D5) = 11 – 8 = 3

D6=(x – d – f – j)+( v – m + o ) Benefit(D6) = 11 – 7 = 4

1-D Case: MDLH is Tractable

a b c d e f g h i j k l m n o p q r

s t u v w

yx

z

MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.

Page 16: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 16

Outline Introduction to MDL with Holes

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics

A Greedy Heuristic Dynamic Programming Quadratic Programming

Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

Page 17: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

17

1 2 3 4 5 6 7

8

abcdefg

i

(c,8),(d,8),(e,8) 4 0

rows length benefit

(f,8),(g,8) 3 2

(a,8),(b,8) 5 -2

columns length benefit

(i,1) 3 2

(i,5) 5 -2

(i,2),(i,3),(i,4)

(i,6),(i,7)

4 0

2-D Case: Optimality is not Preserved Any More

Optimal Solution:{(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)}-{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4)

+(e,2)+(e,3)+(e,4)}+(f,1)+(g,1)+(f,6)+(g,7)Length = 19 Benefit = 28-19 = 9

Page 18: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 18

MDLH is NP-Hard in 2-D Case

It is NP-Hard to find the optimal MDLH description in 2-D data cube;

Not a Trivial Proof: Details are in the paper; Reduction Strategy:

Clique

Maximum Induced Subgraph inComplete Edge-Weighted(CEW) Bipartite Graph

MDL with Holes

Page 19: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 19

Outline Introduction to MDL with Holes

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics

A Greedy Heuristic Dynamic Programming Quadratic Programming

Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

Page 20: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 20

Heuristics for MDLH

Greedy Each time, choose the row/column with the most

benefit Dynamic Programming

A bottom-up method to get the description of a region from the descriptions of its children regions

Quadratic Programming Using a quadratic function to represent the benefit of a

2-d data cube

Page 21: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 21

Example for Comparison with Heuristics

The optimal description for this example:(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+

(d,4)+(b,5)

+(e,6)+(e,8)+(a,11)-(a,8)

Length = 12

Benefit = 8

1 2 3 4 5 6 7 8 9

a

b

c

d

10 11

12

e

Page 22: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 22

Heuristics: A Greedy Heuristic

1 2 3 4 5 6 7 8 9abcd

10 11

12

e

region length benefit holes

(e,6) 1 3 -(d,10) 2 2 (d,5)

(e,1) 2 1 (a,1)(e,2) 2 1 (b,2)(e,3) 2 1 (b,3)

(a,11) 2 1 (a,8)(e,8) 2 1 (a,8)

(c,10) 3 0 (c,4)(c,5)

Description by Greedy:(e,6)+(a,11)+(e,8)-(a,8)+(d,10)-(d,5)+(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3)

The length is 13 The benefit is 20-13 = 7

Page 23: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 23

Greedy: Why it is not optimal?

1 2 3 4 5 6 7 8 9abcd

10 11

12

e

Description from Greedy

1 2 3 4 5 6 7 8 9abcd

10 11

12

e

Optimal Description

A selection of row/column may reduce more total benefit

Page 24: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

24

Heuristics: Dynamic Programming

1 2 3 4 5 10 6 7 8 9 11 12

a 2 2 4

b 2 2 4

c 3 2 5

d 2 2 4

e 2 2 2 1 1 8 1 1 2 1 5 13

1 2 3 4 5 6 7 8 9

a

b

c

d

10 11

12

e

1 2 3 4 5 10 6 7 8 9 11 12

a t2 g t2

b t2 t2 t2

c t2 t2 t2

d g t2 t2

e g g g t1 t1 t2 g t1 g t1 t2 t2

L: The Length of a Region

S: Selection of Rows & Columns (a,10) : (a,2) + (a,3)

L(a,10)=2, S(a,10)=‘t2’ (e,4) : (d,4)

L(e,4)=1, S(e,4)=‘t1’ (d,10): (d,10) – (d,5)

L(d,10)=2, S(d,10)=‘g’

t1

t2

Page 25: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

25

Heuristics: Dynamic Programming(2)

1 2 3 4 5 6 7 8 9

a

b

c

d

10 11

12

e

S 1 2 3 4 5 10 6 7 8 9 11 12

a t2 g t2

b t2 t2 t2

c t2 t2 t2

d g t2 t2

e g g g t1 t1 t2 g t1 g t1 t2 t2

S (e,12)=‘t2’

S (e,11)=‘t2’

D(e,6)+D(e,7)+D(e,8)+D(e,9)

S (e,10)=‘t2’

D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5)

D(e,12)=D(e,10)+D(e,11)

(e,1)-(a,1) (e,2)-(b,2) (e,3)-(b,3) (d,4) (b,5) (e,6) (a,7) (e,8)-(a,8) (a,9)Generated Description:(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9)The length is 13 and the benefit is 20-13 = 7

D(x1,x2):description for region (x1,x2)

t1

t2

Page 26: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 26

Dynamic Programming: Why it is not optimal?

Description by Dynamic Programming

Optimal Description

1 2 3 4 5 6 7 8 9abcd

10 11

12

e

1 2 3 4 5 6 7 8 9abcd

10 11

12

e

Misses the combination of rows and columns

Page 27: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 27

Use variables to represent rows/columns; for a variable v: v=1: the corresponding row/column is selected; v=0: the corresponding row/column is not selected;

f = – Benefit( D) Maximizing the benefit is to minimize the value of f

For the previous example, quadratic programming generates the optimal description;

Optimality is not guaranteed.

Heuristics: Quadratic Programming

Page 28: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 28

Outline Introduction to MDL with Holes

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics

A Greedy Heuristic Dynamic Programming Quadratic Programming

Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

Page 29: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 29

Experiments

We ran a set of experiments on the TPC-H benchmark data set;

We compared the three MDLH heuristics with MDL and GMDL.

Page 30: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

30

Experimental Results: Comparison of All Methods Compression Ratio:

MDLH-Quadratic generates the most concise descriptions: a yardstick of quality;

MDLH-Dynamic is a very close second.

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3916(25%)

4701(30%)

5088(33%)

5971(38%)

6414(41%)

6655(43%)

7422(48%)

7906(51%)

8436(54%)

8944(57%)

9459(61%)

9984(64%)

10537(67%)

10787(69%)

11307(72%)

Number of Blue Cells ( Blue Density)

Co

mp

ress

ion

Rat

io

MDL

MDLH-Greedy

MDLH-Dynamic

MDLH-Quadratic

GMDL-5%

GMDL-10%

Page 31: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

31

Experimental Results: Compression Ratio

1

1.5

2

2.5

3

3.5

4

4.5

10000 (20%)

15000 (30%)

20000 (40%)

25000 (50%)

30000 (60%)

35000 (70%)

40000 (80%)

Number of Blue Cells (Blue Density)

Co

mp

res

sio

n R

ati

o

MDLMDLH-GreedyMDLH-DynamicGMDL-5%GMDL-10%

The more children per parent node, the greater the benefit

Page 32: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 32

Experimental Results: Summary Running time & Scalability:

MDLH-Greedy is the fastest; MDLH-Dynamic runs slower than MDLH-Greedy, but

it is still scalable w.r.t. the number of cells;379 secs

0

20

40

60

80

100

3-d 3-level datacube 3-d 4-level datacube 5-d 4-level datacube

Ru

n T

ime(

secs

)

MDLGMDL

MDLH-GreedyMDLH-Dynamic

Page 33: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 33

Outline Introduction to MDL with Holes

A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics

A Greedy Heuristic Dynamic Programming Quadratic Programming

Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

Page 34: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

34

As the blue density becomes high, a large part of the MDLH description is made up of holes.

Can we further reduce the total length by summarizing ‘Holes’? MDLH description is:

(a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) Total length is 10.

Summarization on holes: (a,6)+(a,8)+(a,9) = (a,10)-(a,7) (d,6)+(d,7)+(d,8) = (d,10)-(d,9)

After summarization on holes: (a,11) - { (a,10) - (a,7)}

+(d,11) - { (d,10) - (d,9)}+(b,6) + (c,8)

Total length is 8.

Extension: Summarization on holes

1 2 3 4 5 6 7 8 9abcd

e

10

11

Page 35: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 35

Conclusions & Contributions We present a new method, MDLH, to compress the

answers of OLAP queries; We present a bottom-up algorithm for 1-d cube; We proved the NP-Hardness of the MDLH problem; We provided three heuristics for MDLH: greedy, dynamic

programming, and quadratic programming; We extended the summarization on holes to further

reduce the total length; We did a set of experiments on the TPC-H benchmark

data to compare the heuristics.

Page 36: MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05 Shaofeng Bu UBC 36

On going work

Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: Return summarized answers to user’s queries; Provide drill down operation for users:

Browse details on blue cells Browse details on holes

Design k-approximation algorithm for MDLH: What is the best quality we can guarantee?