102
DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN K ASER Dept. of Computer Science and Applied Statistics University of New Brunswick, Saint John, NB Canada DANIEL L EMIRE National Research Council of Canada Fredericton, NB Canada Kaser ➡➠■✖

Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Attribute-Value Reordering for Efficient Hybrid OLAP

OWEN KASER

Dept. of Computer Science and Applied StatisticsUniversity of New Brunswick, Saint John, NB Canada

DANIEL LEMIRE

National Research Council of CanadaFredericton, NB Canada

Kaser ➠ ➡➡ ➠ ■ ✖

Page 2: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Overview

✔ Coding dimensional values as integers

✔ Meet the problem (visually)

✔ Background (multidimensional storage)

✔ Packing data into dense chunks

✔ Experimental results

Kaser ➠ ➡➡ ➠ ■ ✖

Page 3: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Background

Cube C is a partial function from dimensions to a measure value.

e.g.,

C : Item × Place × Time → Sales Amount

Kaser ➠ ➡➡ ➠ ■ ✖

Page 4: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Background

Cube C is a partial function from dimensions to a measure value.

e.g.,

C : Item × Place × Time → Sales Amount

CIced Tea, Auckland, January = 20000.0.

CCar Wax, Toronto, February = —.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 5: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Usefulness of Integer Indices in Cube C

Conceptually, CIced Tea, Auckland, January = 20000.0.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 6: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Usefulness of Integer Indices in Cube C

Conceptually, CIced Tea, Auckland, January = 20000.0.

Suggestion: “replace strings by integers” often made.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 7: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Usefulness of Integer Indices in Cube C

Conceptually, CIced Tea, Auckland, January = 20000.0.

Suggestion: “replace strings by integers” often made.

For storage, system [or database designer] likely to codefor Months: January = 1, February = 2, . . .

for Items: Car Wax = 1, Cocoa Mix = 2, Iced Tea = 3,. . .

Kaser ➠ ➡➡ ➠ ■ ✖

Page 8: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Usefulness of Integer Indices in Cube C

Conceptually, CIced Tea, Auckland, January = 20000.0.

Suggestion: “replace strings by integers” often made.

For storage, system [or database designer] likely to codefor Months: January = 1, February = 2, . . .

for Items: Car Wax = 1, Cocoa Mix = 2, Iced Tea = 3,. . .

e.g., with row numbers in dimension tables (star schema)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 9: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Freedom in Choosing Codes

For Item, these codes are arbitrary. Any other assignment of {1, . . . ,n} toItems is a permutation of the initial one.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 10: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Freedom in Choosing Codes

For Item, these codes are arbitrary. Any other assignment of {1, . . . ,n} toItems is a permutation of the initial one.

But for Month, there is a natural ordering.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 11: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Freedom in Choosing Codes

For Item, these codes are arbitrary. Any other assignment of {1, . . . ,n} toItems is a permutation of the initial one.

But for Month, there is a natural ordering.

And for Place, there may be a hierarchy (City, State, Country).

Kaser ➠ ➡➡ ➠ ■ ✖

Page 12: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Freedom in Choosing Codes

For Item, these codes are arbitrary. Any other assignment of {1, . . . ,n} toItems is a permutation of the initial one.

But for Month, there is a natural ordering.

And for Place, there may be a hierarchy (City, State, Country).

Code assignments for Month and Place should be restricted.

But to study the full impact, we don’t.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 13: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Topic (visually)

To display a 2-d cube C, plot pixel at (x,y) when Cx,y 6= 0.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 14: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Topic (visually)

To display a 2-d cube C, plot pixel at (x,y) when Cx,y 6= 0.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 15: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Topic (visually)

To display a 2-d cube C, plot pixel at (x,y) when Cx,y 6= 0.

� rearranging (permuting) rows and columns can cluster/uncluster data

� left: nicely clustered; middle: columns permuted; right: rows too

Kaser ➠ ➡➡ ➠ ■ ✖

Page 16: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization

Let C be a d-dimensional cube, size n1×n2× . . .×nd

Kaser ➠ ➡➡ ➠ ■ ✖

Page 17: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization

Let C be a d-dimensional cube, size n1×n2× . . .×nd

“Normalization” π = (γ1,γ2, . . . ,γd), with each γi a permutation fordimension i.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 18: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization

Let C be a d-dimensional cube, size n1×n2× . . .×nd

“Normalization” π = (γ1,γ2, . . . ,γd), with each γi a permutation fordimension i. i.e., γi is a permutation of 1,2, . . . ,ni.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 19: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization

Let C be a d-dimensional cube, size n1×n2× . . .×nd

“Normalization” π = (γ1,γ2, . . . ,γd), with each γi a permutation fordimension i. i.e., γi is a permutation of 1,2, . . . ,ni.

Define “normalized cube” π(C) by

π(C)[i1, i2, . . . , id] = C[γ1(i1),γ2(i2), . . . ,γd(id)].

Kaser ➠ ➡➡ ➠ ■ ✖

Page 20: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization

Let C be a d-dimensional cube, size n1×n2× . . .×nd

“Normalization” π = (γ1,γ2, . . . ,γd), with each γi a permutation fordimension i. i.e., γi is a permutation of 1,2, . . . ,ni.

Define “normalized cube” π(C) by

π(C)[i1, i2, . . . , id] = C[γ1(i1),γ2(i2), . . . ,γd(id)].

Note: γi: “came from”; thus γ−1i : “went to”

Kaser ➠ ➡➡ ➠ ■ ✖

Page 21: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization

Let C be a d-dimensional cube, size n1×n2× . . .×nd

“Normalization” π = (γ1,γ2, . . . ,γd), with each γi a permutation fordimension i. i.e., γi is a permutation of 1,2, . . . ,ni.

Define “normalized cube” π(C) by

π(C)[i1, i2, . . . , id] = C[γ1(i1),γ2(i2), . . . ,γd(id)].

Note: γi: “came from”; thus γ−1i : “went to”

To retrieve C[i1, . . . , id], use π(C)[γ−11 (i1), . . . ,γ−1

d (id)].

Kaser ➠ ➡➡ ➠ ■ ✖

Page 22: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Sparse vs Dense Storage

#C — number of nonzero elements of C.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 23: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Sparse vs Dense Storage

#C — number of nonzero elements of C.

Density ρ = #Cn1×n2×...×nd

Kaser ➠ ➡➡ ➠ ■ ✖

Page 24: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Sparse vs Dense Storage

#C — number of nonzero elements of C.

Density ρ = #Cn1×n2×...×nd

; ρ� 1: sparse cube. Otherwise, dense.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 25: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Sparse vs Dense Storage

#C — number of nonzero elements of C.

Density ρ = #Cn1×n2×...×nd

; ρ� 1: sparse cube. Otherwise, dense.

Sparse coding:

� goal: storage space depends on #C, not n1× . . .×nd.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 26: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Sparse vs Dense Storage

#C — number of nonzero elements of C.

Density ρ = #Cn1×n2×...×nd

; ρ� 1: sparse cube. Otherwise, dense.

Sparse coding:

� goal: storage space depends on #C, not n1× . . .×nd.

� many approaches developed (decades-old work)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 27: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

A Storage-Cost Model

Idea for sparse case: to record that A[x1,x2, . . . ,xd] = v we record ad+1-tuple (x1,x2, . . . ,xd,v). The xi’s are typically small.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 28: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

A Storage-Cost Model

Idea for sparse case: to record that A[x1,x2, . . . ,xd] = v we record ad+1-tuple (x1,x2, . . . ,xd,v). The xi’s are typically small.

Our model: To store a d-dimensional cube C of size n1×n2× . . .×nd costs

1. n1×n2× . . .×nd, if done densely,

2. (d/2+1) ·#C, if done sparsely.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 29: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Chunked/Blocked Storage (Sarawagi’94)

Partition d-dim cube into d-dim subcubes, blocks.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 30: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Chunked/Blocked Storage (Sarawagi’94)

Partition d-dim cube into d-dim subcubes, blocks.

For simplicity, assume block size m1×m2× . . .×md.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 31: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Chunked/Blocked Storage (Sarawagi’94)

Partition d-dim cube into d-dim subcubes, blocks.

For simplicity, assume block size m1×m2× . . .×md.

Choose “store sparsely” or “store densely” on a chunk-by-chunk basis.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 32: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization Affects Storage Costs

Worst case: all blocks sparse, with 0 < ρ < 1d/2+1.

Best case: each block has ρ = 1 or ρ = 0.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 33: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization Affects Storage Costs

Worst case: all blocks sparse, with 0 < ρ < 1d/2+1.

Best case: each block has ρ = 1 or ρ = 0.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 34: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization Affects Storage Costs

Worst case: all blocks sparse, with 0 < ρ < 1d/2+1.

Best case: each block has ρ = 1 or ρ = 0.

→ →

Lemma 1: there are cubes where normalization can turn worst cases intobest cases.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 35: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Normalization Affects Storage Costs

Worst case: all blocks sparse, with 0 < ρ < 1d/2+1.

Best case: each block has ρ = 1 or ρ = 0.

→ →

Lemma 1: there are cubes where normalization can turn worst cases intobest cases. Example above isn’t quite one!

Kaser ➠ ➡➡ ➠ ■ ✖

Page 36: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Optimal Normalization

Optimal Normalization Problem

Kaser ➠ ➡➡ ➠ ■ ✖

Page 37: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Optimal Normalization

Optimal Normalization Problem

Given : d-dimensional cube C,chunk sizes in each dimension (m1,m2, . . . ,md)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 38: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Optimal Normalization

Optimal Normalization Problem

Given : d-dimensional cube C,chunk sizes in each dimension (m1,m2, . . . ,md)

Output: normalization ϖ that minimizes storage cost H(ϖ(C))

Kaser ➠ ➡➡ ➠ ■ ✖

Page 39: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Optimal Normalization

Optimal Normalization Problem

Given : d-dimensional cube C,chunk sizes in each dimension (m1,m2, . . . ,md)

Output: normalization ϖ that minimizes storage cost H(ϖ(C))

“Code assignment affects chunked storage efficiency”, observed byDeshpande et al., SIGMOD’98.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 40: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Optimal Normalization

Optimal Normalization Problem

Given : d-dimensional cube C,chunk sizes in each dimension (m1,m2, . . . ,md)

Output: normalization ϖ that minimizes storage cost H(ϖ(C))

“Code assignment affects chunked storage efficiency”, observed byDeshpande et al., SIGMOD’98.Sensible heuristic: let dimension’s hierarchy guide you.

Issue apparently never addressed in depth after this (?)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 41: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Complexity

Consider the “decision problem” version that adds storage bound K.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 42: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Complexity

Consider the “decision problem” version that adds storage bound K.Asks “Is there a normalization π with H(π(C))≤ K?”

Kaser ➠ ➡➡ ➠ ■ ✖

Page 43: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Complexity

Consider the “decision problem” version that adds storage bound K.Asks “Is there a normalization π with H(π(C))≤ K?”

Theorem 1. The decision problem for Optimal Normalization is NP-complete, even for d = 2 and m1 = 1 and m2 = 3.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 44: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Complexity

Consider the “decision problem” version that adds storage bound K.Asks “Is there a normalization π with H(π(C))≤ K?”

Theorem 1. The decision problem for Optimal Normalization is NP-complete, even for d = 2 and m1 = 1 and m2 = 3.

Proved by reduction from Exact-3-Cover.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 45: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Volume-2 Blocks

There is an efficient algorithm when ∏di=1mi = 2.

Theorem 2. For blocks of sizek−1︷ ︸︸ ︷

1× . . .×1×2×1. . .×1, the best normal-ization can be computed in O(nk× (n1×n2× . . .×nd)+n3

k) time.

Algorithm relies on a cubic-time weighted-matching algorithm.

Probably can be improved, so time depends on #C, not ∏di=1ni.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 46: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Volume-2 Algorithm

Kaser ➠ ➡➡ ➠ ■ ✖

Page 47: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Volume-2 Algorithm

Here, optimal orderings for vertical dimension include A,B,C,D andC,D,B,A.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 48: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Heuristics

Tested many heuristics. Two more noteworthy:

� Iterated Matching (IM). Applies the volume-2 algorithm to each dimensionin turn, getting blocks of size 2×2×2. . .×2.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 49: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Heuristics

Tested many heuristics. Two more noteworthy:

� Iterated Matching (IM). Applies the volume-2 algorithm to each dimensionin turn, getting blocks of size 2×2×2. . .×2. Not optimal.

� Frequency Sort (FS). γi orders dimension i values by descending fre-quency.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 50: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Frequency Sort (Results)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 51: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 52: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed. In our tests, it worked well.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 53: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed. In our tests, it worked well.

Traced to “much independence between dimensions”.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 54: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed. In our tests, it worked well.

Traced to “much independence between dimensions”.

Result: we can quantify the dependence between the dimensions, getfactor δ, where 0≤ δ≤ 1.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 55: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed. In our tests, it worked well.

Traced to “much independence between dimensions”.

Result: we can quantify the dependence between the dimensions, getfactor δ, where 0≤ δ≤ 1.

Small δ⇒ FS solution is nearly optimal.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 56: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed. In our tests, it worked well.

Traced to “much independence between dimensions”.

Result: we can quantify the dependence between the dimensions, getfactor δ, where 0≤ δ≤ 1.

Small δ⇒ FS solution is nearly optimal.

Calculating δ is easy.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 57: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Independence and Frequency Sort

Frequency Sort (FS) is quickly computed. In our tests, it worked well.

Traced to “much independence between dimensions”.

Result: we can quantify the dependence between the dimensions, getfactor δ, where 0≤ δ≤ 1.

Small δ⇒ FS solution is nearly optimal.

Calculating δ is easy. (In the paper, we used “IS”, where IS= 1−δ.)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 58: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Relating δ to Frequency Sort Quality

FS is actually an approximation algorithm.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 59: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Relating δ to Frequency Sort Quality

FS is actually an approximation algorithm.Theorem 3. FS has an absolute error bound δ(d/2+1)#C.

Corollary. FS has relative error bound δ(d/2+1).

Kaser ➠ ➡➡ ➠ ■ ✖

Page 60: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Relating δ to Frequency Sort Quality

FS is actually an approximation algorithm.Theorem 3. FS has an absolute error bound δ(d/2+1)#C.

Corollary. FS has relative error bound δ(d/2+1).

E.g., for a 4-d cube with δ = .1, FS solution is at most.1× (4/2+1) = 30%worse than optimal.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 61: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Synthetic data does not seem appropriate for this work.

Got some large data sets from UCI’s KDD repository and elsewhere:

� Weather: 18-d, 1.1M facts, ρ = 1.5×10−30

Kaser ➠ ➡➡ ➠ ■ ✖

Page 62: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Synthetic data does not seem appropriate for this work.

Got some large data sets from UCI’s KDD repository and elsewhere:

� Weather: 18-d, 1.1M facts, ρ = 1.5×10−30

� Forest: 11-d, 600k facts, ρ = 2.4×10−16

Kaser ➠ ➡➡ ➠ ■ ✖

Page 63: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Synthetic data does not seem appropriate for this work.

Got some large data sets from UCI’s KDD repository and elsewhere:

� Weather: 18-d, 1.1M facts, ρ = 1.5×10−30

� Forest: 11-d, 600k facts, ρ = 2.4×10−16

� Census: projected down to 18-d, 700k facts, also very sparse.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 64: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Synthetic data does not seem appropriate for this work.

Got some large data sets from UCI’s KDD repository and elsewhere:

� Weather: 18-d, 1.1M facts, ρ = 1.5×10−30

� Forest: 11-d, 600k facts, ρ = 2.4×10−16

� Census: projected down to 18-d, 700k facts, also very sparse.

Seem too sparse by themselves.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 65: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Test Data

To get test data, randomly chose 50 cubes each of

Kaser ➠ ➡➡ ➠ ■ ✖

Page 66: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Test Data

To get test data, randomly chose 50 cubes each of

� Weather datacube (5-d subsets)

� Forest datacube (3-d subsets)

� Census datacube (6-d subsets)

Most had 0.0001≤ ρ≤ 0.2 Also required that, if stored densely, had to fitin 100MB.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 67: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Compression relative to sparse storage (ROLAP):

data sets HOLAP chunked storagedefault normalization good normalization

Census 31% 44% (using FS or IM)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 68: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Compression relative to sparse storage (ROLAP):

data sets HOLAP chunked storagedefault normalization good normalization

Census 31% 44% (using FS or IM)Forest 31% 40% (using IM)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 69: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Compression relative to sparse storage (ROLAP):

data sets HOLAP chunked storagedefault normalization good normalization

Census 31% 44% (using FS or IM)Forest 31% 40% (using IM)

Weather 19% 29% (using FS or IM)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 70: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Compression relative to sparse storage (ROLAP):

data sets HOLAP chunked storagedefault normalization good normalization

Census 31% 44% (using FS or IM)Forest 31% 40% (using IM)

Weather 19% 29% (using FS or IM)

FS did poorly on many Forest cubes.

Is an additional 10% compression helpful?

Kaser ➠ ➡➡ ➠ ■ ✖

Page 71: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Compression relative to sparse storage (ROLAP):

data sets HOLAP chunked storagedefault normalization good normalization

Census 31% 44% (using FS or IM)Forest 31% 40% (using IM)

Weather 19% 29% (using FS or IM)

FS did poorly on many Forest cubes.

Is an additional 10% compression helpful? Disastrous to ignore?

Kaser ➠ ➡➡ ➠ ■ ✖

Page 72: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Experimental Results

Compression relative to sparse storage (ROLAP):

data sets HOLAP chunked storagedefault normalization good normalization

Census 31% 44% (using FS or IM)Forest 31% 40% (using IM)

Weather 19% 29% (using FS or IM)

FS did poorly on many Forest cubes.

Is an additional 10% compression helpful? Disastrous to ignore?

Hopefully, ↑ Yes ↑ No .

Kaser ➠ ➡➡ ➠ ■ ✖

Page 73: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

δ versus FS quality

FrequencySort’s solutions theoretically improve when δ ↓.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 74: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

δ versus FS quality

FrequencySort’s solutions theoretically improve when δ ↓. Do we see thisexperimentally?

Kaser ➠ ➡➡ ➠ ■ ✖

Page 75: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

δ versus FS quality

FrequencySort’s solutions theoretically improve when δ ↓. Do we see thisexperimentally?

Yes. Problem: don’t know optimal.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 76: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

δ versus FS quality

FrequencySort’s solutions theoretically improve when δ ↓. Do we see thisexperimentally?

Yes. Problem: don’t know optimal. Substitute: try IM!

Kaser ➠ ➡➡ ➠ ■ ✖

Page 77: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

ForestCensus

Weather

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

.1 .2 .3 .4 .5 .6 .7 .8 .9 1.0

FS=IM

0.410.28

Rat

io F

S/IM

δ

Kaser ➠ ➡➡ ➠ ■ ✖

Page 78: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Conclusions/Summary

✔ Good normalization leads to useful space savings.

✔ Going for optimal normalization is too ambitious.

✔ FS is provably good when δ is low; experiments show bound seemspessimistic.

✔ Should help in a chunk-based OLAP engine being developed.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 79: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Questions??

Kaser ➠ ➡➡ ➠ ■ ✖

Page 80: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Extra Slides

Kaser ➠ ➡➡ ➠ ■ ✖

Page 81: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IS Preliminaries

Underlying probabilistic model: nonzero cube cells uniformly likely to bechosen.

For each dimension j , get probability distribution ϕ j

ϕ jv =

nonzero cells with index v in dimension j#C

If all {ϕ j | j ∈ {1, . . . ,d}} jointly independent:Pr[C[i1, i2, . . . , id] 6= 0] = ∏d

j=1ϕ ji j

Kaser ➠ ➡➡ ➠ ■ ✖

Page 82: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IS Preliminaries

Underlying probabilistic model: nonzero cube cells uniformly likely to bechosen.

For each dimension j , get probability distribution ϕ j

ϕ jv =

nonzero cells with index v in dimension j#C

If all {ϕ j | j ∈ {1, . . . ,d}} jointly independent:Pr[C[i1, i2, . . . , id] 6= 0] = ∏d

j=1ϕ ji j

and (claim) clearly FS gives an optimalalgorithm.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 83: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IS

IS= ∑C[i1, i2, . . . , id] 6= 0

(d

∏j=1

ϕ ji j

)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 84: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IS

IS= ∑C[i1, i2, . . . , id] 6= 0

(d

∏j=1

ϕ ji j

)

(1− IS)#C is the expected number of nonzero cells that, if we assumeindependence, we would mispredict as zero. At worst, such cells will haveto be stored sparsely, at cost (d/2+1) each.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 85: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Relating IS to Frequency Sort Quality

Theorem 4. Given cube C, let ϖ be an optimal normalization and fs be aFrequency Sort normalization, then

H( f s(C))−H(ϖ(C))≤(

d2

+1

)(1− IS)#C

where H(·) gives the storage cost of an cube.

Not even considering block dimensions; further improvements?

Kaser ➠ ➡➡ ➠ ■ ✖

Page 86: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Exact 3-Cover (X3C)

[See Garey and Johnson, 1979]

Given : Set Sand a set T of three-element subsets of S.

Question: Is there a T ′ ⊆ T such that each s∈ Soccurs in exactly onemember of T ′?

X3C is known to be NP-complete.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 87: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Transforming X3C to Optimal Normalization

Given an instance of X3C, make a |T |× |S| cube. For s∈ Sand T ∈ T ,the cube has an allocated cell corresponding to (T,s) ⇔ s∈ T.

Cube has 3|T | cells to be stored.

Can be stored for ≤ 9|T |− |S| ⇔ the answer to the instance of X3C is“yes”.

Thus Optimal Normalization is NP-hard.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 88: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Example Transformation

X3C: X = {1,2,3,4,5,6},T = {{1,2,3},{1,2,4},{3,5,6}}Optimal Normalization: Blocks 3×1

1 2 3 4 5 6{1,2,3} −→ 1 1 1 - - -{1,2,4} −→ 1 1 - 1 - -{3,5,6} −→ - - 1 - 1 1

and set storage bound to 21.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 89: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Example Transformation

X3C: X = {1,2,3,4,5,6},T = {{1,2,3},{1,2,4},{3,5,6}}Optimal Normalization: Blocks 3×1

1 2 3 4 5 6{1,2,3} −→ 1 1 1 - - -{1,2,4} −→ 1 1 - 1 - -{3,5,6} −→ - - 1 - 1 1

and set storage bound to 21.

Storage model: elements in full blocks cost 2 each, elements in non-fullblocks cost 3 each.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 90: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Example Transformation

X3C: X = {1,2,3,4,5,6},T = {{1,2,3},{1,2,4},{3,5,6}}Optimal Normalization: Blocks 3×1

1 2 3 4 5 6{1,2,3} −→ 1 1 1 - - -{1,2,4} −→ 1 1 - 1 - -{3,5,6} −→ - - 1 - 1 1

and set storage bound to 21.

Storage model: elements in full blocks cost 2 each, elements in non-fullblocks cost 3 each.

Answer here is “yes”: swap columns 3 and 4.6 elements in full blocks, 3 in nonfull blocks; (6×2+3×3 = 21).

Kaser ➠ ➡➡ ➠ ■ ✖

Page 91: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IM is not optimal

[1 − 1 11 − − −

]This is optimal for 1×2 and 2×1 (storage cost 6)

Kaser ➠ ➡➡ ➠ ■ ✖

Page 92: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IM is not optimal

[1 − 1 11 − − −

]This is optimal for 1×2 and 2×1 (storage cost 6)

but has cost 8 for 2×2 blocks,

Kaser ➠ ➡➡ ➠ ■ ✖

Page 93: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

IM is not optimal

[1 − 1 11 − − −

]This is optimal for 1×2 and 2×1 (storage cost 6)

but has cost 8 for 2×2 blocks, whereas[1 1 − 11 − − −

]has cost 6 for 2×2 blocks.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 94: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Test Data

In OLAP, various aggregated views might be materialized.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 95: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Test Data

In OLAP, various aggregated views might be materialized.

Group by of some subset of dimensions: is one cube in the overalldatacube [Gray et al ’96]

Kaser ➠ ➡➡ ➠ ■ ✖

Page 96: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Test Data

In OLAP, various aggregated views might be materialized.

Group by of some subset of dimensions: is one cube in the overalldatacube [Gray et al ’96]

To get test cases, randomly choose cubes from the datacube. (i.e.,randomly select some subset of dimensions to get a test case).

Kaser ➠ ➡➡ ➠ ■ ✖

Page 97: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Compression Relative to What?

What default normalization do we compare against?

Kaser ➠ ➡➡ ➠ ■ ✖

Page 98: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Compression Relative to What?

What default normalization do we compare against?

Data sets were obtained “relationally” : lists of records, we scansequentially.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 99: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Compression Relative to What?

What default normalization do we compare against?

Data sets were obtained “relationally” : lists of records, we scansequentially.

Default normalization: code 0 for attribute value used in first record,

Kaser ➠ ➡➡ ➠ ■ ✖

Page 100: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Compression Relative to What?

What default normalization do we compare against?

Data sets were obtained “relationally” : lists of records, we scansequentially.

Default normalization: code 0 for attribute value used in first record, code 1goes to the next-seen attribute, etc.

“First seen, first numbered”.

Kaser ➠ ➡➡ ➠ ■ ✖

Page 101: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Compression Relative to What?

What default normalization do we compare against?

Data sets were obtained “relationally” : lists of records, we scansequentially.

Default normalization: code 0 for attribute value used in first record, code 1goes to the next-seen attribute, etc.

“First seen, first numbered”.

Unused alternatives: sorted-as-strings, random, . . .

Kaser ➠ ➡➡ ➠ ■ ✖

Page 102: Attribute-Value Reordering for Efficient Hybrid OLAPowen/dolap-presentation.pdf · DOLAP’03, November 7, 2003. Attribute-Value Reordering for Efficient Hybrid OLAP OWEN KASER

DOLAP’03, November 7, 2003.

Index of Extra Slides

� more IS details

� BBT

� NP Completeness of 1x3

� Iterated Matching is Suboptimal

� Why cuboids from datacube

� Default normalization

Kaser ➠ ➡➡ ➠ ■ ✖