Transcript
Page 1: qCube: Efficient integration of range query operators over a high dimension data cube

qCube: Efficient integration of range query operators over a high

dimension data cube

Rodrigo Rocha SilvaDoctorate Student

Prof. Dr. Celso Massaki HirataAdvisor

Prof. Dr. Joubert de Castro LimaCo-Advisor

ITA – INSTITUTO TECNOLÓGICO DE AERONÁUTICAElectronic Engineering and Computer Science Division - EEC/I

Department of Computer ScienceBrazil

Page 2: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Goal

Present a new cube approach, designed for high dimension range queries. Our cube approach, named Query Cube (qCube), implements Equal, Not Equal, Greater or Less than, Some, Between and Similar

range query operators and Distinct, Sub-cube and Top-k Similar inquire query

operators

Page 3: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Topics

– Motivation

– Data Cube

– Related Work

– Query Cube (qCube)

– Experiments

– Results

– Conclusions

Page 4: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Motivation

Users need to view data in a tangible way, such as reports,

cross tables and histograms

Page 5: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

• Suppose that at some decision-makingprocess it is necessary the followinginformation :

Motivation

“What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 25-40

years? Return results for all countries”

“The average temperatures above 30 degrees

Celsius on the weekends of leap years in the last

200 years.”

Page 6: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

628º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

A data cube, introduced by Gray et al., 1996, is

a generalization of the group-by operator over all

possible combinations of dimensions with

various granularity aggregates.

Data Cube

Page 7: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

728º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

A data cube has exponential complexity with respect to the

number of dimensions

For an input with size d the output has size 2d

Data Cube

Page 8: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

828º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Data Cube

• Hierarchies

DisciplineDiscipline

DepartmentDepartment

YearYear

YearYear

DayDay

HourHour

Page 9: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

928º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Base Relation R – 11 tuples

A B C COUNT

a1 b1 c1 1

a3 b3 c2 1

a2 b3 c2 1

a3 b1 c1 1

a2 b1 c1 1

a2 b2 c2 1

a1 b1 c2 1

a2 b2 c1 1

a3 b1 c2 1

a1 b3 c2 1

a2 b1 c2 1

A B C COUNT

* * * 11

a1 * * 3

a2 * * 5

a3 * * 3

* b1 * 6

* b2 * 2

* b3 * 3

* * c1 4

* * c2 7

a1 b1 * 2

a1 b3 * 1

a2 b1 * 2

a2 b2 * 2

a2 b3 * 1

a3 b1 * 2

a3 b3 * 1

a1 * c1 1

a1 * c2 2

a2 * c1 2

a2 * c2 3

a3 * c1 1

a3 * c2 2

* b1 c1 3

* b1 c2 3

FULL 3D CUBE

A B C COUNT

* b2 c1 1

* b2 c2 1

* b3 c2 3

a1 b1 c1 1

a3 b3 c2 1

a2 b3 c2 1

a3 b1 c1 1

a2 b1 c1 1

a2 b2 c2 1

a1 b1 c2 1

a2 b2 c1 1

a3 b1 c2 1

a1 b3 c2 1

a2 b1 c2 1

+

38 tuples

Data Cube

Page 10: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1028º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Related Work – Frag-Cubing Approach

From book Han and Kamber: Data Mining Concepts and Techniques

• Partitions the data vertically

• Reduces high-dimensional cube into a set of lower

dimensional cubes

• Lossless reduction

• Offers tradeoffs between the amount of pre-processing

and the speed of online computation

Page 11: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1128º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

• Let the cube aggregation function be count

• Divide the 5 dimensions into 2 shell fragments: – (A, B, C) and (D, E)

tid A B C D E

1 a1 b1 c1 d1 e1

2 a1 b2 c1 d2 e1

3 a1 b2 c1 d1 e2

4 a2 b1 c1 d1 e2

5 a2 b1 c1 d1 e3

Related Work – Frag-Cubing Example

From book Han and Kamber: Data Mining Concepts and Techniques

Page 12: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

• Build traditional invert index or RID list

Attribute Value TID List List Size

a1 1 2 3 3

a2 4 5 2

b1 1 4 5 3

b2 2 3 2

c1 1 2 3 4 5 5

d1 1 3 4 5 4

d2 2 1

e1 1 2 2

e2 3 4 2

e3 5 1

Related Work – Frag-Cubing 1-D Inverted Indices

From book Han and Kamber: Data Mining Concepts and Techniques

Page 13: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

• Generalize the 1-D inverted indices to multi-dimensional

ones in the data cube sense

• Compute all cuboids for data cubes ABC and DE while

retaining the inverted indices

• For example, shell

fragment cube ABC

contains 7 cuboids:

– A, B, C

– AB, AC, BC

– ABC

• This completes the offline

computation stage

111 2 3 1 4 5a1 b1

04 5 2 3a2 b2

24 54 5 1 4 5a2 b1

22 31 2 3 2 3a1 b2

List SizeTID ListIntersectionCell

∩ ⊗

Related Work – Frag-Cubing Approach

From book Han and Kamber: Data Mining Concepts and Techniques

Page 14: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

• If measures other than count are present, store in ID_measure table separate from the shell fragments

tid count sum

1 5 70

2 3 10

3 8 20

4 5 40

5 2 30

Related Work – Frag-Cubing Measure Table

From book Han and Kamber: Data Mining Concepts and Techniques

Page 15: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

• Given the fragment cubes, process a query as follows

1. Divide the query into fragment, same as the shell

2. Fetch the corresponding TID list for each fragment

from the fragment cube

3. Intersect the TID lists from each fragment to construct

instantiated base table

4. Compute the data cube using the base table with any

cubing algorithm

Related Work – Frag-Cubing Query

From book Han and Kamber: Data Mining Concepts and Techniques

Page 16: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1628º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

…NMLKJIHGFEDCBA

Online

ComputationBase Table

Related Work – Frag-Cubing Approach

From book Han and Kamber: Data Mining Concepts and Techniques

Page 17: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1728º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube ApproachImplements a set of tuple identifiers per dimension attribute, similar to Frag-Cubing;

Therefore, qCube can answer point queries using tuple identifiers intersections and range queries using unions plus intersections algorithms, regardless measure function types.

Frag-Cubing just implements point and some inquire queries.

There is no Frag-Cubing solution for queries like

“What is the women journal research papers variance impact, using months {1, 3, 5, 7, 11}, year 2012 and ages varying

from 25-40 years? Return results for all countries”

Page 18: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1828º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube Approach

Implements the range query operators:• Equal; • Not Equal; • Greater or Less than; • Some; • Between and Similar.

Also implements inquire query operators: • Distinct; • Sub-cube; • Top-k Similar.

Over a high dimension data cube.

Page 19: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

1928º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube Architecture

Page 20: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2028º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

TID A B C D E

1 a1 b1 c1 d1 e1

2 a2 b2 c2 d2 e2

3 a1 b1 c1 d1 e1

4 a3 b3 c3 d2 e2

5 a1 b1 c4 d1 e2

6 a5 b5 c5 d2 e2

Attribute Value TID List Attribute Value TID List

a1 1, 3, 5 c2 2

a2 2 c3 4

a3 4 c4 5

a5 6 c5 6

b1 1, 3, 5 d1 1, 3, 5

b2 2 d2 2, 4, 6

b3 4 e1 1, 3

b5 6 e2 2, 4, 5, 6

c1 1, 3

Function Variance Count Average Skewness Standard deviation

tid M1 M2 M3 M4 M5

1 2.56 1 10 1 877686769698

2 3.14 1 20 0 7986676867.99

3 2.45 1 10 1 -7878789.8777

4 6.7 1 11 1 -99974333.23

5 9 1 3 1 100045.655

6 1 1 1 1 1

qCube Computation

Page 21: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2128º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

The same qCube Computation algorithm

qCube Update

Page 22: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube UpdateTID A B C D E

1 a1 b1 c1 d1 e1

2 a2 b2 c2 d2 e2

3 a1 b1 c1 d1 e1

4 a3 b3 c3 d2 e2

5 a1 b1 c4 d1 e2

6 a5 b5 c5 d2 e2

tid A B C D E F

5 a3 b1 c4 d1 e2

7 a3 b2 c3 d3 e3

8 a2 b3 c4 d3 e2 f1

9 a5 b5 c5 d1 e1 f2

Attribute Value TID List Attribute Value TID List

a1 1, 3 c4 5, 8

a2 2, 8 c5 6, 9

a3 4, 5, 7 d1 1, 3, 5, 9

a5 6, 9 d2 2, 4, 6

b1 1, 3, 5 d3 7, 8

b2 2, 7 e1 1, 3, 9

b3 4, 8 e2 2, 4, 5, 6, 8

b5 6, 9 e3 7

c1 1, 3 f1 8

c2 2 f2 9

c3 4, 7

Page 23: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

pQ= a1:*:*:*:e1

qCube Query

Attribute Value TID List Attribute Value TID List

a1 1, 3 c4 5, 8

a2 2, 8 c5 6, 9

a3 4, 5 d1 1, 3, 5, 9

a5 6, 9 d2 2, 4, 6

b1 1, 3, 5 d3 7, 8

b2 2, 7 e1 1, 3, 9

b3 4, 8 e2 2, 4, 5, 6, 8

b5 6, 9 e3 7

c1 1, 3 f1 8

c2 2 f2 9

c3 4, 7

Page 24: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

rOp= (greater than + less than + between + some + different +

similar x (fv1 … fvn))

qCube Range and Inquire Query

iOp =(sub-cube + distinct + top-k similar x (fv1 … fvn))

qCube rearranges Q sub-queries in order to improve query

response times

a result of Q we have qR=(TID1, TID2 … TIDk), where TIDi is the ith tuple identifier of relation R.

Page 25: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube Query - example

Page 26: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2628º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube Query - example

“What is the women journal research papers variance impact,

using months {1, 3, 5, 7, 11}, year 2012 and ages varying from 25-

40 years? Return results for all countries”

In Q, they are (sex = women, paperType=journal, year=2012).

The range queries (month = (1,3,5,7,11), age <>25-40) are also sorted according to their cardinalities.

In Q, there is inquire query (country=distinct).

Page 27: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2728º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

qCube Query - example

Page 28: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2828º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Experiments• We tested qCube Computation and Query algorithms against Frag-Cubing

algorithm used in [Li et al. 2004];

• The qCube algorithms were coded in Java 64 bits;

• Frag-Cubing is a free and open source C++ application(http://illimine.cs.uiuc.edu/);

• The synthetic base relations were created using data generator provided by the IlliMine project;

• The IlliMine project is an open-source project to provide various approaches for data mining and machine learning.

• Frag-Cubing approach is part of IlliMine project.

• We ran the algorithms in two Intel Xeon six-core processors with 2.4GHz each core, 12MB cache and 128GB of RAM DDR3 1333MHz.

• The system runs Windows Server 2008 64 bits, High Performance version.

Page 29: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

2928º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Results - Performance Evaluation of Point Queries and Skewed Relations

Response time per query over 100 trials: TTTT=107; CCCC=5000;

DDDD=30, SSSS=0

Response time per query over 100

trials: TTTT=107; CCCC=5000; DDDD=30,

SSSS=2.5

Page 30: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

3028º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Results - Performance Evaluation of Range Query Operators and Skewed Relations

Response time queries with one infrequent point operator: T=107; C=5000; D=30, S=2.5

Page 31: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

3128º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Results - Performance Evaluation of of Inquire Operators and Skewed Relations

Response time queries with inquire operators: T = 107; C = 5000; D = 30, S = 2.5.

Page 32: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

3228º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Results - Runtime and Memory Consumption

Page 33: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

3328º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Conclusions

• qCube has linear runtime and memory consumption, similar to Frag-Cubing;

• It implements Not Equal, Greater or Less than, Some, Between

and Similar range query operators and Distinct, Sub-cube and

Top-k Similar inquire query operators;

• When compared with Frag-Cubing, qCube is faster to answer

point and inquire queries with sub-cube operators.

• It introduces a different cube representation with less empty cells

than Frag-Cubing;

• Frag-Cubing cannot answer two sub-cube operators in a data

cube with 107 tuples, C=5000, D=30 and S=2.5.

Page 34: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

3428º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

ConclusionsInteresting research directions to further extend qCube:

First, we must experiment it with holistic measures. Update and computation experiments with many holistic measures are a hard problem;

TIDs can become huge, thus memory consumption and intersection costs can become impracticable, and therefore we must address an efficient solution to partition TIDs with fast data retrieval.

Multicore and multicomputer versions of qCube must be implemented.

qCube must be improved to answer top-k queries combined with range, point and inquire queries.

Experiments with high dimensional text cubes must be made to evaluate qCube , specially its text measures computing.

Page 35: qCube: Efficient integration of range query operators over a high dimension data cube

Wednesday, October 02, 2012

qCube: Efficient integration of range query operators over a high dimension data cube

3528º Simpósio Brasileiro de Banco de Dados - Rodrigo Rocha Silva

Acknowlegements


Recommended