ITCS 6163 Cube Computation. Two Problems Which cuboids should be materialized? –Ullman et.al. Sigmod 96 paper How to efficiently compute cube? –Agrawal

ITCS 6163

Cube Computation

Two Problems

• Which cuboids should be materialized?– Ullman et.al. Sigmod 96 paper

• How to efficiently compute cube?– Agrawal et. al. vldb 96 paper

Implementing Data Cubes efficiently

DSS queries must be answered fast!

1 min OK…

seconds great!

> 1 min NOT ACCEPTABLE!

One solution: Materialize frequently asked queries (or supersets)

Picking the right set is difficult (O(2n))

What to materialize

• Nothing: pure ROLAP

• Everything: too costly

• Only part: key idea: many cells are computable from other cells.

ABC

ABAC

BC

Dependent and independent cells

Example TPC-D benchmark (supplier, part, customer,sales)Q(find all sales for each part) is dependent on Q (find all sales for each part and each supplier)

(p,all,all) (p,s,all) (dependency)

(p,all,all) = i,j (p, si, cj )

Example

1.PSC 6 Million cells 2.PC 6 Million cells 3.PS 0.8 Million cells 4.SC 6 Million cells 5.P 0.2 Million cells 6.S 0.05 Million cells 7.C 0.1 Million cells 8.ALL 1 cell

PSC

PC PSSC

P S C

ALL

(Cube lattice)

Example (2)

We want to answer (p, all,all) (Sales group-by part)

a) If we have 5 materialized, just answer.

b) If we have 2, compute ….Use 6 million cells, do they fit in RAM?

Cost of a) 0.2 M

Cost of b) 6 M

Decisions, decisions...

How many views must we materialize to get good performance?

Given space S (on disk), which views do we materialize?

In the previous example we’d need space for 19 Million cells.

Can we do better?

Avoid going to the raw (fact table) data: PSC (6 M)

PC (6M) can be answered using PSC (6 M) no advantage

SC (6 M) can be answered using PSC (6 M) no advantage

Example again

1 PSC 6M 6M 2 PC 6M - 3 PS 0.8M 0.8M 4 SC 6M - 5 P 0.2 M 0.2M 6 S 0.01M 0.01M 7 C 0.1M 0.1

---------7.11M vs. 19 M

(about the same performance)

Formal treatment

Q1 Q2 (dependency)

Q(P) Q(PC) Q(PSC) (lattice)

Add hierarchies

C (customers) S (suppliers) P(parts)

N (nation-wide cust. ) SN (nation-wide) Sz Ty e.g., USA, Japan) (size) (type)

DF (domestic- ALLforeign)

ALL (all cust.) ALL

Formal treatment(2)CP (6M)

CTy (5.99M)

C (0.1M)

C Sz (5M)

N Ty (3,750)

NP (5M)N Sz (1,250)

N 2 5

Ty 150

Sz (50)

P (0.2 M)

ALL(1)

Formal Treatment(3)

Cost model:

Cost (Q) = # cells Qa Q Qa

With indexes we can make this better!

How do we know the size of the views?

Sampling

Optimizing Data Cube latticesFirst problem (no space restrictions)

VERY HARD problem (NP-complete)

Heuristics:Always include the “core” cuboid.At every step you have materialized Sv viewsCompute the benefit of view v relative to Sv as: For each w v define Bw

Let u be the view of least cost in Sv such that w u

If Cost(v) < Cost (u) Bw = Cost(v)-Cost(u) (-)else Bw = 0

Define B(V,Sv) = - w v B(w)

Greedy algorithm

Sv = {core view}

for i = 1 to k beginselect v not in Sv such that

B(v,Sv) is maximumSv = Sv {v}

End

v = b Bb C(v) = 50 C(a) =100 -50Bc not ( c b) 0 Bd C(v) = 50 C(a) =100 -50Be C(v) = 50 C(a)=100 -50Bf not ( f b) 0 Bg C(v) = 50 C(a)=100 -50Bh C(v) = 50 C(a)=100 -50

v = c Bb not ( b c) 0 Bc C(v) = 75 C(a) =100 -25Bd not ( d c) 0 Be C(v) = 75 C(a) =100 -25Bf C(v) = 75 C(a)=100 -25Bg C(v) = 75 C(a)=100 -25Bh C(v) = 75 C(a)=100 -25

Sv = {a}

v = d Bb not ( b d) 0 Bc not (c d) 0 Bd C(v) = 20 C(a)=100 -80Be not (e d) 0Bf not (f d) 0Bg C(v) = 20 C(a)=100 -80Bh not (h d) 0

Sv = {a,b}

B(b,Sv) = 250B(b,Sv) = 250 B(c,Sv) = 125B(b,Sv) = 250 B(c,Sv) = 125 B(d,Sv) = 160

B(b,Sv) = 250 B(c,Sv) = 125 B(d,Sv) = 160 B(e,Sv) = 210B(b,Sv) = 250 B(c,Sv) = 125 B(d,Sv) = 160 B(e,Sv) = 210 B(f,Sv) = 120

B(b,Sv) = 250 B(c,Sv) = 125 B(d,Sv) = 160 B(e,Sv) = 210 B(f,Sv) = 120 B(g,Sv) = 99

B(b,Sv) = 250 B(c,Sv) = 125 B(d,Sv) = 160 B(e,Sv) = 210 B(f,Sv) = 120 B(g,Sv) = 99 B(h,Sv) = 90

B(c,Sv) = 50B(c,Sv) = 50 B(d,Sv)=60B(c,Sv) = 50 B(d,Sv)=60 B(e,Sv) = 60B(c,Sv) = 50 B(d,Sv)=60 B(e,Sv) = 60 B(f,Sv) =70B(c,Sv) = 50 B(d,Sv)=60 B(e,Sv) = 60 B(f,Sv) =70 B(g,Sv) = 49

B(c,Sv) = 50 B(d,Sv)=60 B(e,Sv) = 60 B(f,Sv) =70 B(g,Sv) = 49 B(h,Sv)=40

Sv = {a,b,f}

B(c,Sv)=25B(c,Sv)=25 B(d,Sv) = 60B(c,Sv)=25 B(d,Sv) = 60 B(e,Sv)=50B(c,Sv)=25 B(d,Sv) = 60 B(e,Sv)=50 B(g,Sv) = 49

B(c,Sv)=25 B(d,Sv) = 60 B(e,Sv)=50 B(g,Sv) = 49 B(h,Sv)=30

Sv = {a,b,d,f} v = e Bb not (b e) 0

Bc not (c e) 0 Bd not (d e) 0 Be C(v) = 30 C(a)=100 -70 Bf not (f e)

0 Bg C(v) = 30 C(a)=100 -70 Bh C(v) =30 C(a)=100 -70

v = f Bb not (b f) 0 Bc not (c f) 0 Bd not (d f) 0Be not (e f) 0 Bf C(v) =40 C(a) = 100 -60Bg not (g f)

0 Bh C(v) =40 C(a)=100 -60

v = g Bb not (b g) 0 Bc not (c g)

0 Bd not (d g) 0 Be not (e g) 0 Bf not (f g) 0 Bg C(v) =1 C(a)=100 -99 Bh not (h g) 0

v = h Bb not (b h) 0 Bc not (c h)

0 Bd not (d h) 0 Be not (e h) 0 Bf not (f h) 0 Bg not (g h) 0 Bh C(v)=10 C(a)=100 -90

v = c Bc C(v)= 75 C(a)=100 -25 Bd not (d c) 0Be C(v) =75 C(b)=50 0Bf C(v)= 75 C(a)=100 -25Bg C(v)=75 C(b)=50 0Bh C(v)=75 C(b)=50 0

v = d Bc not (c d) 0Bd C(v)= 20 C(b)=50 -30Be not (e d) 0Bf not (f d)

0 Bg C(v)=20 C(b)=50 -30 Bh not (h d) 0

v = e Bc not (c e) 0Bd not (d e) 0 Be C(v)=30 C(b)=50 -20Bf not (f e)

0 Bg C(v)=30 C(b)=50 -20 Bh C(v)=30 C(b)=50 -20

v = f Bc not (c f) 0Bd not (d f) 0 Be not (e f) 0Bf C(v)=40 C(a)=100 -60Bg not (g f) 0Bh C(v)=40 C(b)=50 -10

v = g Bc not (c g) 0Bd not (d g) 0 Be not (e g) 0Bf not (f g)

0 Bg C(v)=1 C(b)=50 -49 Bh not (h g) 0

v = h Bc not (c h) 0Bd not (d h) 0 Be not (e h) 0Bf not (f h)

0 Bg not (g h) 0Bh C(v)=10 C(b)=50 -40

v = c Bc C(v)=75 C(a)=100 -25Bd not (d c) 0 Be C(v)=75 C(b)=50 0Bg C(v)=75 C(b)=50 0

Bh C(v)=75 C(b)=50 0

v = d Bc not (c d) 0 Bd C(v)=20 C(b)=50 -30Be not (e d) 0 Bg C(v)=20 C(b)=50 -30

Bh not (h d) 0

v = e Bc not (c e) 0 Bd not (d e) 0Be C(v)=30 C(b)=50 -20 Bg C(v)=30 C(b)=50 -20

Bh C(v)=30 C(f)=40 -10

v = g Bc not (c g) 0 Bd not (d g) 0Be not (e g) 0Bg C(v)=1 C(b)=50 -49

Bh not (h g) 0a

cb

g

fed

h

100

50 75

3020 40

1 10

A simple example

v = h Bc not (c h) 0 Bd not (d h) 0Be not (e h) 0Bg not (g h) 0 Bh C(v)=10 C(f)=40 -30

MOLAP example

Hyperion’s Essbase:

www.hyperion.com to download white paper and product demo.

• Builds a special secondary-memory data structure to store the cells of the core cuboid.

• Assumes that data is sparse and clustered along some dimension combinations

� Chooses dense dimension combinations.

� The rest are sparse combinations.

http://www.hyperion.com/

StructuresTwo levels:

• Blocks in the first level correspond to the dense dimension combinations. The basic block will have the size proportional to the product of the cardinalities for these dimensions. Each entry in the block points to a second-level block.

• Blocks in the second level correspond to the sparse dimensions. They are arrays of pointers, as many as the product of the cardinalities for sparse dimensions. Each pointer has one of three values: null (non-existent data), impossible (non-allowed combination) or a pointer to an actual data block.

Data Example

Dimensions

Departments (Sales,Mkt)

Time

Geographical information

Product

Distribution channels

Departments will generally have data for each Time period. (so the two are the dense dimension combination)

Geographical information, Product and Distribution channels, on the other hand are typically sparse (e.g., most cities have only one Distribution channel and some Product values).

Structures revisited

S,1Q S,2Q S,3Q S,4Q M,1Q M,2Q M,3Q M,4Q

Geo., Product, Dist

Data block Data block

Allocating memory

Define member structure (e.g., dimensions)

Select dense dimension combinations and create upper level structure

Create lower level structure.

Input data cell: if pointer to data block is empty, create new

else insert data in data block

Problem 2: COMPUTING DATACUBES

Four algorithms

• PIPESORT

• PIPEHASH

• SORT-OVERLAP

• Partitioned-cube

Optimizations

• Smallest-parent– AB can be computed from ABC, ABD, or ABCD.

Which one should be use?• Cache-results

– Having computed ABC, we compute AB from it while ABC is still in memory

• Amortize-scans– We may try to compute ABC, ACD, ABD, BCD in one

scan of ABCD• Share-sorts• Share-partitions

PIPESORT

Input: Cube lattice and cost matrix.Each edge (eij in the lattice is annotated with two costs: S(i,j) cost of computing j from i when i is not sortedA(i,j) cost of computing j from i when i is sorted

Output: Subgraph of the lattice where each cuboid (group-by) is connected to a single parent from which it will be

computed and is associated with an attribute order in which it will be sorted. If the order is a prefix of the

order of its parent, then the child can be computed without sorting the parent (cost A); otherwise it has to be sorted (cost B). For every parent there will be only one out-edge labeled A.

PIPESORT (2)Algorithm: Proceeds in levels, k = 0,…,N-1 (number of dimensions). For each level, finds the best way of computing level k from level k+1 by reducing the problem to a weighted bypartite problem.

Make k additional copies of each group-by (each node has then, k+1 vertices) and connect them to the same children of the original.

From the original copy, the edges have A costs, while the costs from the copies have S costs.

Find the minimum cost matching in the bypartite graph. (Each vertex in level k+1 matched with one vertex in level k.)

Example

AB AB AC AC BC BC

2 10 5 12 13 20

A B C

AB AB AC AC BC BC

2 10 5 12 13 20

A B C

Transformed lattice

AB(2) AB(10) AC(5) AC(12) BC(13) BC(20)

A B C

Explanation of edges

AB(2) AB(10)

A

This means that we really have BA (we need to sort it to get A)

This means we have AB (no need to sort)

PIPESORT pseudo-algorithm

Pipesort:(Input: lattice with A() and S() edges costs)For level k = 0 to N-1

Generate_plan(k+1)For each cuboid g in level k+1

Fix sort order of g as the order of the cuboid connected to g by

an A edge;

Generate_plan

Generate_plan(k+1) Make k additional copies of each level k+1 cuboid;Connect each copy to the same set of vertices as the

original;Assign costs A to original edges and S to

copies; Find minimum cost matching on the transformed graph;

Example

ABCD

ABC ABD ACD BCD

AB AC AD BC BD CD

A B C D

ALL

CBAD

CBA BAD ACD DBC

BA AC AD CB DB CD

A B C D

ALL

CBAD(BADC)(ACDB)(DBCA)

CBA BAD(DBA) ACD (ADC)(CDA)DBC

BA AC AD CB DB CD

A B C D

ALL

PipeHash

Input: lattice

PipeHash chooses for each vertex the parent with the smallest estimated size. The outcome is a minimum spanning tree (MST), where each vertex is a cuboid and an edge from i to j shows that i is the smallest parent of j.

Available memory is not usually enough to compute all the cuboids in MST together, so we need to decide what cuboids can be computed together (sub-MST), and when to allocate and deallocate memory for different hash-tables and what attribute to use for partitioning data.

PipeHash

Input: lattice and estimated sizes of cuboids

Initialize worklist with MST of the search latticeWhile worklist is not empty

Pick tree from worklist;T’ = Select-subtree of T to be executed next;Compute-subtree(T’);

Select-subtree

Select-subtree(T)

If memory required by T less than available, return(T);Else, let S be the attributes in root(T)

For any s S we get a subtree Ts of T also rooted at T including all cuboids that contain s

Ps= maximum number of partitions of root(T) possible if partitioned on s. .

Choose s such that mem(Ts)/Ps < memory available and Ts the largest over all subsets of S.

Remove Ts from T. (put T-Ts in worklist)

Compute-subtree

Compute-subtree

numP = mem(T’) * f / mem-availablePartition root of T’ into numPFor each partition of root(T’)

For each node n in T’Compute all children of n in one scanIf n is cached, saved to disk and release memory occupied by its hash table

Example

ABCD

ABC ABD ACD BCD

AB AC BC AD CD BD

A B C D

ALL

ABCD

ABC ABD ACD

AB AC AD

A

ABCD

ABC BCD

AB BC CD BD

A B C D

ALL

ABCD

ABC ABD ACD

Partition in A

To diskAB AC AD

ABC ACD

Partition in A

To diskAB AC

ABC

Partition in A

To disk

A

AB

Partition in A

To disk

Remaining Subtrees

all

A B C D

AB BC CD BD

ABC BCD

ABCD

OVERLAP

Sorted-Runs:

Consider a cuboid on j attributes {A1,A2,…,Aj}, we use B= (A1,A2,…,Aj) to denote the cuboid sorted on that order.

Consider S = (A1,A2,…,Al-1,Al+1,…,Aj), computed using the one before. A sorted run R of S in B is defined as:

R = S (Q) where Q is a maximal sequence of tuples of B such that for each tuple in Q, the first l columns have the same value.

Sorted-run

B = [(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(c,3,1)]

S = first and third attribute

S = [(a,2),(a,3),(b,3),(b,2),(c,1)]

Sorted runs: [(a,2),(a,3)] [(a,2)] [(b,3)] [(b,2)] [(c,1)]

Partitions

B and S have a common prefix (A1… Al-1)

A partition of the cuboid S in B is the union of sorted runs such that the first l-1 columns of all the tuples of the sorted runs have the same values.

[(a,2),(a,3)] [(b,2),(b,3)] [(c,1)]

OVERLAP

Sort the base cuboid: this forces the sorted order in which the other cuboids are computed

ABCD

ABC ABD ACD BCD

AB AC BC AD CD BD

A B C D

ALL

OVERLAP(2)If there is enough memory to hold all the cuboids, compute all. (very seldom true). Otherwise, use the partition as a unit of computation: just need sufficient memory to hold a partition. As soon as a partition is computed, tuples can be pipelined to compute descendant cuboids (same partition) and then written to disk. Reuse then the memory to compute the next partition.

Partitions:[(a,2),(a,3)] [(b,2),(b,3)] [(c,1)]Example XYZ->XZ

Compute (a,1,2),(a,1,3),(a,2,2) cells in XYZ, use them to compute (a,2),(a,3) in XZ. Then write all these cells to diskCompute (b,1,3),(b,3,2) cells in XYZ, XZ. Use them to compute [(b,2),(b,3)] in XZ. Then write these cells to disk.

XYZ=[(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(C,3,1)]

Compute C,3,1 cell in XYZ, XZ. Use them to compute (c,1) in XZ. Then write these cells to disk.

OVERLAP(3)Choose a parent to compute a cuboid: DAG. Goal: minimize the size of the partitions of a cuboid, so less memory is needed. E.g., it is better to compute AC from ACD than from ABC, (since the sort order matches and the partition size is 1). This is a hard problem.

Heuristic: maximize the size of the common prefix.

ABCD

ABC ABD ACD BCD

AB AC BC AD CD BD

A B C D

ALL

OVERLAP (4)Choosing a set of cuboids for overlapped computation, according to your memory constraints. To compute a cuboid in memory, we need memory equal to the size of its partition. Partition sizes can be estimated from cuboid sizes by using some distribution (uniform?) assumption. If this much memory can be spared, then the cuboid will be marked as in Partition state. For other cuboids, allocate a single page (for temporary results), these cuboids are in SortRun state. A cuboid in partition state can have its tuples pipelined for computation of its descendants.

A cuboid can be considered for computation if it is the root, or its parent is marked as in Partition State. The total memory allocated to all cuboids cannot be more than the available memory.

OVERHEAD (5)

Again, a hard problem… Heuristic: traverse the tree in BFS manner.

ABCD

ABC(1) ABD(1) ACD(1) BCD(50)

AB(1) AC(1) BC(1) AD(5) CD(40) BD(1)

A(1) B(1) C(1) D(5)

ALL

OVERLAP (6)Computing a cuboid from its parent:Output: The sorted cuboid S

foreach tuple of B doif (state == Partition) then process_partition();else process_sorted_run( );

OVERLAP (7)

Process_partition:If the input tuple starts a new partition, output the current

partition at the end of the cuboid, start a new one.If the input tuple matches with an existing tuple

in the partition, update the aggregate.Else input tuple aggregate.

Process_sorted_run:If input tuple starts a new sorted_run, flush all the pages of

current sorted_run, and start a new one.If the input tuple matches the last tuple in the

sorted_run, recompute the aggregate.Else, append the tuple to the end of the

existing run.

Observations

In ABCD ABC, the partition size is 1. Why?

In ABCD ABD, the partition size is equal to the number of distinct C values, Why?

In ABCD BCD the partition size is the size of the cuboid BCD, Why?

Running example with 25 pages

ABCD(1)

ABC(1) ABD(1) ACD(1) BCD(1)

AB(1) AC(1) BC(1) AD(5) CD(40) BD(1)

A(1) B(1) C(1) D(5)

ALL

1 page

BC(1) CD(1) BD(1)

B(1)

BCD(10)

CD(10)

C(1) D(5)

ALL(1)

Other issues

• Iceberg cube– it contains only aggregates above certain

threshold.– Jiawei Han’s sigmod 01 paper

Documents

ITCS 6163 Cube Computation. Two Problems Which cuboids should be materialized? –Ullman et.al. Sigmod 96 paper How to efficiently compute cube? –Agrawal