Measuring the sky on computing data cubes via skylining the measures.bak

Measuring the Sky: On Computing DataCubes via Skylining the Measures

Man Lung Yiu, Eric Lo, and Duncan Yung

Abstract—Data cube is a key element in supporting fast OLAP. Traditionally, an aggregate function is used to compute the values in

data cubes. In this paper, we extend the notion of data cubes with a new perspective. Instead of using an aggregate function, we

propose to build data cubes using the skyline operation as the “aggregate function.” Data cubes built in this way are called “group-by

skyline cubes” and can support a variety of analytical tasks. Nevertheless, there are several challenges in implementing group-by

skyline cubes in data warehouses: 1) the skyline operation is computational intensive, 2) the skyline operation is holistic, and 3) a

group-by skyline cube contains both grouping and skyline dimensions, rendering it infeasible to precompute all cuboids in advance.

This paper gives details on how to store, materialize, and query such cubes.

Index Terms—Query processing; data warehouse and repository.

Ç

1 INTRODUCTION

IN the data-warehousing environment, OLAP tools havebeen extensively used for a wide range of decision-

support applications such as sales analysis, customeranalysis, marketing, and services planning. These OLAPtools are built upon a multidimensional data model, in whichdata tuples are partitioned into different cells based on thevalues of their dimension attributes. For each cell of tuples,an aggregate function (e.g., SUM()) is applied to their measureattributes (e.g., sales) and the resulting aggregated valuecontributes to the value of that cell. A data cuboid is thenformed by constituting cells that are created from the sameset of dimension attributes. Each data cuboid represents aunique view of the underlying data. The collection of datacuboids, which are based on different combinations ofdimension attributes, forms a data cube.

In traditional data cubes, an aggregate function takes asinput a set of measure values and returns a single numericvalue. In this paper, we extend the notion of data cube witha new perspective. Specifically, we study the issues ofbuilding data cubes that exploit the skyline operator [4] asthe postoperation instead of the traditional aggregatefunctions. We name this type of data cubes as group-by

skyline cubes. The skyline operator has been well recognizedas a very important decision-support operator in recentliterature and has started to be implemented in commercialquery engines [7]. Given a set S of skyline attributes, a tuplet is said to dominate another tuple t0, denoted by t �S t0, if

ð9 Ai 2 S; t½Ai� < t0½Ai�Þ ^ ð8 Ai 2 S; t½Ai� � t0½Ai�Þ ð1Þ

assuming that smaller values are preferable over larger

ones. Here, we use t½Ai� to represent the value of the

attribute Ai of the tuple t. Given a set D of tuples, theskyline operation � on D is defined as

�ðD; SÞ ¼ ft 2 D j6 9t0 2 D; t0 �S tg: ð2Þ

In other words, a tuple t belongs to the skyline result set ifno other tuple dominates it.

We observe that a group-by skyline cube has twodistinguished features when compared to a traditional datacube. First, a cell of a group-by skyline cube stores a set ofskyline tuples rather than a single numeric aggregate value.Second, since the skyline operation can be applied tomultiple measure attributes, a group-by skyline cubeincludes not only a set of usual grouping dimensions, butalso a set of skyline dimensions, in which the latter is definedon the measure attributes.

We illustrate the concept of group-by skyline cube by theexample in Fig. 1, which is a tableDof employee records. Eachtuple represents an employee, with the employee’s “Region”and “Position,” as well as his/her evaluated scores “Mental-ity,” “Attitude,” and “Flexibility.” Suppose that lower scoresare preferable over higher ones. If we want to find theoutstanding employees of each position, the table in Fig. 2acan be used to answer the query directly. That table is called agroup-by skyline cuboid with grouping dimension set G ¼f‘‘Position’’g and skyline dimension set S ¼ f‘‘Mentality; ’’‘‘Attitude; ’’ ‘‘Flexibility’’g. That essentially means that theemployee in D is first grouped based on their positions andthen the skyline operation is applied to each group to findthose who are not dominated by the others on all three scores.Figs. 2b and 2c show two more examples of group-by skylinecuboids for the employee data set. Cuboid C2 groupsthe employees based on their regions and positions and theskyline operation is applied to their “Mentality” and“Attitude” attributes. Cuboid C3 essentially partitions theemployees in the same way as C2 but the skyline operation isapplied to only the attribute “Attitude.” Hence, for the dataset D in Fig. 1, its corresponding group-by skyline cube is acollection of all cuboids, whereas each such cuboid CiðGi; SiÞhas its grouping dimension set Gi as a nonempty subset of

492 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012

. The authors are with the Department of Computing, Hong KongPolytechnic University, Hung Hom, Kowloon, Hong Kong.E-mail: {csmlyiu, ericlo, cskwyung}@comp.polyu.edu.hk.

Manuscript received 4 Dec. 2009; revised 5 May 2010; accepted 17 Aug. 2010;published online 22 Dec. 2011.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-12-0819.Digital Object Identifier no. 10.1109/TKDE.2010.253.

1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

http://ieeexploreprojects.blogspot.com

{“Region,” “Position”} and its skyline dimension set Si as anonempty subset of {“Mentality,” “Attitude,” “Flexibility”}.

We find group-by skyline cube useful for other dataanalysis applications as well. One can extend a super-market’s transactional data warehouse with group-by sky-line cube so that attributes like “Location,” “Time,” and“Product” still serve as the grouping dimensions and the setof measure attributes like “Sales” and “Number of visitors”serve as the skyline dimensions. For hotel analysis, we canbuild a group-by skyline cube with respect to all dimensionssuch as “Star” (e.g., 5-star, 4-star) and “City” and all measureattributes such as “Room price,” “Room quality,” “Trans-portation convenience,” and “Entertainment quality.”

Using the skyline operation as an “aggregate function” indata cubes and the notion of group-by skyline cubes is acompletely new concept. First, while the skyline operatorand its variants have been extensively studied (e.g., [23],[10]), none of them has ever discussed the incorporation ofthe skyline operation into traditional data warehouses as apostoperation. Second, although the concept of skycubes [36],[26] exists, that is fundamentally different from our conceptof group-by skyline cube. Skycubes are designed for theefficient evaluation of skyline queries in any nonemptysubspaces. In other words, that does not consider anygrouping at all. In practice, however, skyline analysis is oftento be more meaningful if the data tuples are first groupedbased on their dimensions before finding their skylines. Let’stake the employee data set in Fig. 1 as an example. Withoutgrouping, finding the skyline of all employee, no matterwhether it is in the full space (i.e., all three measureattributes) or in any subspace (e.g., “Mentality” and“Attitude”), may not be that meaningful because it is unfairto compare an Admin staff with a Technical staff. In contrast,building a group-by skyline cube for that data set allows themanagement to find out the outstanding employee of eachregion and/or position. Consider the NBA player statisticsas another example. Obviously, finding the skyline, despitethat is full-space skyline or subspace skyline, of all NBAplayers is not that meaningful because it is unfair to compareMichael Jordon with Wilt Chamberlain, which were active in

NBA in different periods and played different positions. Incontrast, we can analyze the NBA data using a group-byskyline cube, treating all dimensions such as “Year,”“Team,” and “Position” as the set of all possible groupingdimensions and treating all measure attributes such as“Points,” “Assists,” and “Rebounds” as the set of all possibleskyline dimensions.

Implementing the concept of group-by skyline cube intoday’s data warehouses poses several technical challenges.First, the skyline operation is much more computationalintensive [4], [7] when compared to simple aggregations suchas COUNT(), SUM(), and AVG(). Consequently, building orquerying group-by skyline cubes needs to consider not onlythe I/O cost, but also the computation (CPU) cost. Second,the skyline operation is holistic in nature as the skyline of acuboid is not necessarily derivable from another cuboid. Forexample, consider cuboids C2 and C3 in Fig. 2, although theyshare the same set of grouping dimensions and the skylinedimension set of C3 is a subset of that of C2, their skylineresults are not derivable from each other. Specifically, we cansee that C2 is not derivable from C3 because the skylineemployee b in C2 does not exist in C3. Moreover, C3 is notderivable from C2 because the skyline employee f in C3 doesnot exist in C2 as well. Third, the dimensions that define agroup-by skyline cube include not only the usual groupingdimensions but also the measure attributes. For a group-byskyline cube that is built on a data set with jGj groupingdimensions and jSj measure attributes, there would beð2jGj � 1Þð2jSj � 1Þ possible cuboids. That explosive numberof possible cuboids makes materialization of group-byskyline cubes especially challenging.

To the best of our knowledge, the building of group-byskyline cube, or the implementation of the skyline operationas a postoperation in data warehouses, has not beenaddressed previously in the research literature or incommercial products. This paper studies this issue in detail.Our contributions can be summarized as follows:

1. The concept of group-by skyline cube is presented.That includes the discussion of what a “group-byskyline cuboid” is, the relationships between differ-ent group-by skyline cuboids, and how thesecuboids constitute a group-by skyline cube.

2. The technical details of supporting group-by skylinecube are presented. Specifically, we propose tomaterialize a group-by skyline cube as an extendedgroup-by skyline cube (ES-cube). In an ES-cube, sky-line results across cuboids are derivable from eachother. We further develop construction and queryprocessing algorithms for ES-cube. We also develop

YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 493

Fig. 1. Employee database D.

Fig. 2. Group-by skyline cuboids. (a) C1. (b) C2. (c) C3.


a budget-based partial materialization algorithmthat is able to select and materialize a good subsetof cuboids in ES-cube that yields the highest queryimprovement in terms of both CPU and I/O costs.

3. An extensive set of experiments has been carried outon both real and synthetic data. Experimental resultsshow that queries can be answered efficiently

The rest of the paper is organized as follows: Section 2gives the background and the related work of this paper.Section 3 elaborates the concept of group-by skyline cube inmore detail. Section 4 presents the issues related toimplementation of group-by skyline cube. Section 5 reportsexperimental results. Section 6 concludes the paper withfuture research directions. The symbols to be used insubsequent discussion are summarized in Table 1.

2 BACKGROUND AND RELATED WORK

2.1 Data Cubes

A data cube [13] can be viewed as a collection of cuboids, inwhich each cuboid stores the group-by aggregate resultwith respect to a particular set of attributes calleddimensions. For a data cube with nonholistic aggregatefunctions, cuboids can be organized as a lattice. Fig. 3 showsa lattice of 23 cuboids of a data set with three dimensionattributes A1, A2, and A3. There exists a path from a cuboidCi to a cuboid Cj if the attributes of Ci contain those of Cj,meaning that Ci can be used to derive Cj. For example, anycuboid in Fig. 3 can be derived from the top cuboid C1.

To meet the performance demands imposed by OLAPoperations, the cube materialization approach [14] whichprecomputes some cuboids in advance, is extensivelyapplied to speed up the evaluation of various OLAPqueries. Full materialization refers to the precomputationof all cuboids (i.e., the full cube) and is impractical becausethe number of cuboids is exponential to the number ofdimensions. On the other hand, partial materialization

refers to the precomputation of a subset of cuboids (i.e., thesubcube) and is generally more frequently used. The cost ofprocessing a query using a cuboid Ci is described by a linearcost model [15], i.e., the query time is assumed to solelydepend on the I/O time of scanning Ci, which is linear tothe size of Ci. During query evaluation, if a cuboid Cirequested by an OLAP query has already been materialized,it can serve as the query result right away. Even if Ci has notbeen materialized, it can still be efficiently derived from thesmallest materialized cuboid Cj in which the dimensions ofCi are a subset of those of Cj.

Given a space budget, the materialization algorithm in[15] selects cuboids to be materialized based on the greedyheuristics, aiming at improving the overall query cost.Initially, it materializes the top cuboid, i.e., the cuboid withall dimensions, and puts it into a materialized set MTbecause other cuboids can always be derived from the topcuboid directly. Afterward, it iteratively materializes acuboid Ci 62 MT that is expected to bring the maximumbenefit to the overall query cost and adds it toMT . Based onthe linear cost model, the benefit of materializing a cuboid Ciis defined as the improvement in I/O, with respect to the setof already materialized cuboids. For example, Fig. 3 showsthe number of tuples of each cuboid in a bracket. The topcuboid C1, with 90 tuples, is the first to be materialized. Next,the benefit of materializing cuboid C4 is calculated as ð90�50Þ � 4 because answering queries related to C4, C6, C7, or C8

can now exploit C4 (which requires scanning only 50 tuples)instead of the top cuboid C1 (which requires scanning90 tuples). Since the benefit of C4 is the highest among all thecuboids that have not been materialized, it is next selected tobe materialized and added to MT . The iteration continuesuntil the selected cuboids exceed the space budget.

Owning to its importance to data-warehousing technol-ogy, research related to data cube has continuouslyattracted the attentions of many researchers. Some focuson the efficient computation of cubes (e.g., [3], [27]), somefocus on the compression and summarization of cubes (e.g.,[29]), and some focus on the variations of cubes (e.g., [35]).

2.2 Skyline Problems

2.2.1 Skyline Computation Algorithms

The skyline operator, introduced in [4], retrieves each tuple tfrom the data setD such that it is not dominated by any othertuple t0 of D. Existing skyline computation methods can bedivided into two categories: 1) index-based solutions [30],[17], [23], and 2) nonindexed solutions [4], [8], [12], [2], [37].

Early index-based solutions [30] operate on bitmap andBþ-tree indices on the skyline attributes, whereas the others[17], [23] operate on an R-tree that indexes the data set byskyline attributes. The Branch and Bound Skyline (BBS)


TABLE 1Summary of Symbols

Fig. 3. A lattice of cuboids with attributes A1, A2, and A3. The number oftuples in each cuboid is in the bracket.


algorithm [23] is the state-of-the-art indexed approach forskyline computation, and its I/O cost is proven to beoptimal with respect to the R-tree instance.

The state-of-the-art skyline computation algorithm fornonindexed data is OSP [37]. It will be adopted as theskyline computation method in our proposed solution (seeSection 4.3). Thus, we describe this method in detailbelow. Like the SFS algorithm [8], the OSP algorithm firstsorts the data set by a monotone score function F (onskyline attributes). F is simply the SUM function in [37].The main difference is that OSP employs a (main-memory) OSP tree for indexing the skyline points foundso far, rendering it efficient to process the remaining datapoints. Fig. 4 illustrates how OSP works on a set of points(in the space defined by two skyline attributes S1 and S2).Each contour (in dotted line) depicts all possible locationswith the same SUM score. The sorted points follow theorder: t2; t4; t7; t5; t1; t6; t3.

When a point t is retrieved (by the sorted order), wecheck whether the OSP tree contains any point t0 thatdominates t. This is implemented by a depth-first traversalof the OSP tree branches that potentially contain some pointdominating t. If there is no such point t0, then we report t asa skyline point and insert it into the OSP tree. Each node ofthe OSP tree stores exactly one skyline point, and itsmaximum number of children is 2jSj � 2, where S is the setof skyline attributes. In the example, when the first point t2is retrieved, it is inserted as the tree root, which decomposesthe domain space into 22 ¼ 4 regions. Since t2 is a skylinepoint, its bottom-left region is guaranteed to contain nopoints. Also, its top-right region is dominated by t2 so theregion cannot contain any skyline points. Those two regionsdo not need to be maintained by the OSP tree. The nextretrieved point t4 is a skyline as it is not dominated by anypoint in the OSP tree. Since t4 falls into the top-left region oft2, it is inserted into the left subtree of t2. Then, the nextpoint t7 is again a skyline point; it falls into the bottom-rightregion of t2 so it becomes the right child of t2. The next pointt5 is also a skyline point, so it is inserted into the left subtreeof t7. The algorithm proceeds in this fashion until all theremaining points have been examined.

2.2.2 Variants of Skyline Queries

Since the introduction of the skyline query in [4], numerousskyline variants have been proposed, for example, skyband[23], spatial skyline [28], reverse skyline [10], subspaceskyline [32], [31], [16], dominance-based analysis [20],relaxation of the skyline conditions [6], [5], k mostrepresentative skyline points [21], as well as skyline

processing using parallel execution [9] and skylining fromdistributed data sources [1].

In the OLTP environment, Papadias et al. [23] and Luket al. [22] have studied the evaluation of skyline queries withgroup-by attributes. Papadias et al. [23] proposed an efficientR-tree-based algorithm for processing skyline queries withgroup-by attributes. Unfortunately, their solution relies onthe availability of an R-tree (with both group-by attributesand skyline attributes) so it is not suitable for handling ad-hocqueries. Luk et al. [22] focused on the relational engine andinvestigated how to find an efficient query plan for skylinequeries with group-by attributes. Their solution, however,does not study the materialization of skyline cuboids in theOLAP environment.

2.2.3 Subspace Skyline Analysis

Building data cubes using the skyline operation as the“aggregate function,” to the best of our knowledge, havenot been addressed before. Work in [24], [36], [25], [26],[31], [34] is developed for subspace skyline analysis. In theexample of Fig. 2, the cuboids in our group-by skylinecube contain both group-by attributes and skyline attri-butes. In contrast, the skycube [24], [36], [25] of a data set Dis defined as the collection of skyline result set �ðD; SÞ foreach nonempty subset S of S� but without any groupingattributes. A compressed skycube (CSC) [34], [26] is acompressed structure for storing a skycube; it still doesnot involve any grouping attributes. Those works do notconsider any grouping operation and partial materializa-tion, and thus they are different from our work here.Another related work is P-cube [35]. P-cube is designed tofacilitate the processing of selection skyline queries, i.e.,finding the skyline from tuples that satisfy the WHERE

clause in SQL. P-cube is orthogonal to us because it doesnot consider partial materialization and it is designed forselection queries that access a small portion of data. Incontrast, group-by skyline cube is designed for OLAP-style queries that access a significant portion of data forreport generation.

2.2.4 Skyline Cardinality and Cost Estimation

In this paper, we adopt a budget-based partial materializa-tion approach for selecting group-by skyline cuboids to bematerialized such that they yield the highest improvementin query cost. Recall that each cell of a cuboid correspondsto a set of skyline tuples (with respect to particular group-by attributes). Thus, it is important to study existing workon estimating skyline cardinality and its computation cost[11], [7], [12], [38].

Godfrey et al. [11], [12] propose a mathematical model ofthe skyline cardinality for a uniformly distributed data set,and the asymptotic cost of several skyline algorithms in theaverage case and the worst case. Chaudhuri et al. [7] applythe log sampling technique for estimating the skylinecardinality and the skyline computation cost of the BNLand SFS algorithms, with respect to arbitrary data distribu-tion. Zhang et al. [38] propose a kernel-based model for amore accurate estimation of the skyline cardinality ofarbitrary data sets; however, it does not study thecomputation cost of any skyline algorithm.


Fig. 4. Example of an OSP tree.


3 CONCEPT OF GROUP-BY SKYLINE CUBE

In this section, we first introduce the concept of group-byskyline cube whose measure is defined by a skylineoperation, and then show that some nice properties oftraditional data cubes are no longer valid in this context. Wewill discuss how those invalidated properties can be re-enabled again by using a skyline variant definition inSection 4.

3.1 Definition of Group-By Skyline Cube

Group-by skyline cube is intended for the efficient evalua-tion of various group-by skyline queries. Let D be a relationaltable instance, with the schema A ¼ ðA1; A2; . . . ; AkÞ. Forsimplicity, suppose that every attribute Ai is numeric.Given a set G � A of grouping attributes and a groupinstance g of G, we define the set DðgÞ as the set of tuples ofD belonging to the group instance g. Accordingly, a group-by skyline query is defined as follows:

Definition 1 (Group-by skyline query). Given the sets G andS of attributes, a group-by skyline query Q ¼ ðG; SÞcomputes a skyline result set �ðDðgÞ; SÞ for each groupinstance g defined on G.1

Considering the data set D in Fig. 1, the result of a group-by skyline query Q with its grouping dimension setG ¼ {“Region,” “Position”} and its skyline dimension setS ¼ {“Mentality,” “Attitude”} is shown in Fig. 2b. Anexample of a group instance g is (“Asia,” “Admin”).

In order to evaluate various group-by skyline queriesefficiently, we introduce the concepts of group-by skylinecuboid and group-by skyline cube.

Definition 2 (Group-by skyline cube). Let G� be the set of allpossible grouping dimensions and S� be the set of all possibleskyline dimensions. Given a subset G G� and a subsetS S�, a group-by skyline cuboid CðG;SÞ is defined as acollection of cells, each cell g (corresponding to a group of G)stores the set of skyline result set �ðDðgÞ; SÞ. The group-byskyline cube is defined as the collection of CðG;SÞ, for anysubset G G� and nonempty S S�.Consider the relational table D in the left part of Fig. 5. In

this example, G� ¼ fG1; G2; G3g and S� ¼ fS1; S2g. Thegroup-by skyline cube for D thus contains

ð2jG�j � 1Þð2jS�j � 1Þ;

i.e., 7� 3 ¼ 21 cuboids, in total. The right part of Fig. 5depicts the contents of six cuboids. For instance, cuboid Acontains three cells, indicating the distinct combinationsðc1; d1Þ, ðc1; d2Þ, ðc2; d2Þ on the grouping attributes ðG2; G3Þ,for tuples in the table D. For the cell ðc1; d1Þ of cuboid A, theskyline of this group are tuples t1 and t2.

3.2 Conditions for Sharing Computations of Cubes

Most data cube algorithms, regardless of whether they areused to compute data cubes or to answer queries, also relyheavily on the sharing of computations to achieve highefficiency. For example, in a sales application, if a cuboid(year, department: COUNT(sales)) that stores the sales of eachdepartment in every year has been computed, it can be usedto compute the cuboid (year : COUNT(sales)) that stores thesales of all departments in every year, without accessing thebase data. However, it is unclear whether such sharing isstill valid in the context of group-by skyline cube.

In the following, we show that group-by skyline cubes canshare some but not all computational effort. Specifically, forgroup-by skyline cuboids that share the same set of skylinedimensions S, the computational effort can be shared:

Lemma 1 (Union of skyline). Given any subset D0 D,�ðD; SÞ �ðD0; SÞ [�ðD � D0; SÞ.

Through Lemma 1 [19], we arrive at the followinglemma that shows that one group-by skyline cuboid can bederived from another if they share the same set of skylinedimensions:

Lemma 2 (Group-by skyline cuboid hierarchy). Given twogroup-by skyline cuboids CðG; SÞ and CðG0; SÞ such thatG0 G, the group-by skyline cuboid CðG0; SÞ can be derivedfrom CðG;SÞ.

Proof. By the definition of group-by skyline cuboid, eachcell g0 in CðG0; SÞ corresponds to a subset V of cells inCðGÞ, i.e., Dðg0Þ ¼

Sg2V DðgÞ and different DðgÞ are

disjoint. By applying Lemma 1, the result of g0 can bederived from the results of the set V of cells in CðG;SÞ.Thus, this lemma is proved. tu

Fig. 6 shows the lattice of group-by skyline cubiods (i.e.,group-by skyline cube) for the data set in Fig. 5. Weorganize the group-by skyline cuboids in a way such thateach edge represents a parent-child relationship betweentwo group-by skyline cuboids, meaning that the result of achild cuboid can be derived from its parent cuboid throughLemma 2. For instance, there is an edge from cuboid A to


Fig. 5. An example data set (on the left) and some of its group-by skyline cuboids (on the right).

1. Note that when S contains only one attribute, the skyline operationbecomes a simple aggregate function MAX or MIN. Nonetheless, ourdefinition includes such case for generality.


cuboid D, showing that cuboid D can be derived from

cuboid A. To illustrate, cell c1 of cuboid D in Fig. 5 can be

obtained by finding the skylines among the cells ðc1; d1Þ and

ðc1; d2Þ of cuboid A.Unfortunately, we remark that computational efforts

between cuboids that share different sets of skyline dimen-

sions cannot be shared unless the distinct value condition (no

two tuples share the same value on any subset of skyline

attributes) [36] holds. For instance, in Fig. 6, although cuboid

C shares the same set of grouping dimensions as cuboid A

and the skyline dimension of cuboidC is a subset of cuboidA,

C cannot be derived from A, unless the tuples have distinct

values on the skyline attributes. To illustrate, in Fig. 5, the cell

ðc2; d2Þ of cuboid C includes a tuple t6 that does not exist in

cuboid A. As we have mentioned, the efficiency of the state-

of-the-art data cube construction/query algorithms relies

heavily on the fact that many computational efforts can be

shared. However, in a group-by skyline cube, the sharing of

computation is limited by the distinct value condition, which

almost never holds in real data sets. Therefore, the group-by

skyline cuboids in Fig. 6 are separated into three disjoint

clusters in which computational efforts are not sharable

across the clusters. Fortunately, there are ways to remove

such barriers and we will discuss them in the next section.

4 IMPLEMENTATION

In this section, we address the issues related to the

implementation of group-by skyline cube. Specifically, we

attempt to answer the following research questions:

. ðIRQQ1Þ Given the poor connections between group-byskyline cuboids, how can we enable the sharing ofcomputational efforts between cuboids? (Section 4.1)

. ðIRQQ2Þ Given a materialized cuboid, how can a query Qbe answered efficiently? (Section 4.2)

. ðIRQQ3Þ Given a limited storage space budget, how shallwe select a subset of cuboids that offers the bestimprovement in query performance? (Section 4.3)

. ðIRQQ4Þ After selecting the cuboids, how can we constructthem in an efficient manner? (Section 4.4)

4.1 Materialization: Extended Group-By SkylineCube

In order to maximize the sharing of computation of group-

by skyline cuboids, especially the sharing of computation

across the set of cuboids separated by the distinct value

condition, we materialize a group-by skyline cuboid using

an extended definition of skyline. We show that group-by

skyline cubes materialized in this way can enable sharing

across various cuboids.

Definition 3 (Extended group-by skyline). The extendedgroup-by skyline of the tuple set DðgÞ with respect to theskyline attribute set S is defined as

�þðDðgÞ; SÞ ¼ ft 2 DðgÞ j6 9t0 2 DðgÞ; t0 �þS tg;

where a tuple t is said to strictly dominate another tuple t0

with respect to the attribute set S, denoted by t �þS t0, if

ð8 Ai 2 S; t½Ai� < t0½Ai�Þ:

The notion of extended skyline [33] was originallyproposed for skyline evaluation in P2P network. We borrowthat concept and propose to materialize a group-by skylinecube as an extended group-by skyline cube.

Definition 4 (Extended group-by skyline cube, ES-cube).

Given a set G of grouping attributes and a set S of skylineattributes, we define an extended group-by skyline cuboidCþðG;SÞ as a collection of cells, where each cell g

(corresponding to a group of G) stores the set of extendedskyline �þðDðgÞ; SÞ. The extended group-by skyline cubeis defined as the collection of CþðG; SÞ, for any subset G G�and S S�.

Lemma 3 (Extended group-by skyline cuboid hierarchy).

Given two extended group-by skyline cuboids CþðG; SÞ andCþðG0; S0Þ such that G0 G and S0 S, the cuboidCþðG0; S0Þ can be derived from CþðG; SÞ.

Proof. Each cell g0 in the cuboid CþðG0; S0Þ corresponds to asubset V of cells in the cuboid CþðG;SÞ, i.e., Dðg0Þ ¼Sg2V DðgÞ and different DðgÞ are disjoint. Thus, the result

of g0 can be derived from the results of the set V of cellsin CþðG; SÞ. tu

Lemma 3 is powerful. It allows us to maximize thepossibility of sharing computations among the ES-cuboids.Fig. 7 shows the lattice of ES-cuboids (i.e., ES-cube) for thedata set in Fig. 5. Note that, through Lemma 3, the threedisjoint clusters of cuboids in Fig. 6 are now connected. Forinstance, cuboids B and C cannot be derived from anycuboid in the cluster of cuboids S1S2 (the left cluster in Fig. 6)but now they can be derived from extended cuboid Aþ. Toillustrate, Fig. 8 depicts the extended group-by skylineversion of the six cuboids we illustrated in Fig. 5. Note that,now cuboid Aþ contains tuple t6 because t5 cannot strictlydominate t6 on every skyline attribute. Consequently, cell


Fig. 6. Group-by skyline cube.

Fig. 7. Extended group-by skyline cube.


ðc2; d2Þ of group-by skyline cuboidC can now be obtained by

computing the skyline of cell ðc2; d2Þ of cuboid Aþ on

dimension S2. Similarly, group-by skyline cuboids E and F

cannot be derived from any group-by skyline cuboid in the

cluster of cuboids S1S2 but now they can be derived from

extended group-by skyline cuboid Dþ. For brevity, we do

not show all parent-child relationships in Fig. 7, but every

ES-cuboid on the right-hand side of Fig. 7 (and every group-

by skyline cuboid) indeed can be derived from a cuboid on

the left-hand side. Furthermore, every ES-cuboid (and every

group-by skyline cuboid) can be derived from the top cuboid

CþðG1G2G3; S1S2Þ.

4.2 Storage and Query Processing

Now, we discuss how to answer a group-by skyline queryQ ¼ ðGQ; SQÞ using a materialized ES-cuboid CþðG;SÞ,where GQ G and SQ S. We first concentrate on thecase that GQ ¼ G and SQ S. A simple method in this caseis to project the points in each cell g 2 Cþ to the queryskyline subspace SQ, and then find the skyline. Fig. 9 showsa set of extended skyline points in a cell. If SQ ¼ fS1g, thismethod projects all four points to the S1 and then finds theskyline among those points. Note that if the skylinedimensions have low cardinalities, the performance gainbrought by ES-cubes may significantly decrease because thenumber of extended skylines increases. However, we aregoing to show that, by carefully storing the extendedskyline points, ES-cubes can still provide certain improve-ment when the skyline dimensions have low cardinalities.First, from the example above, we can see that t3 is at mostas good as t1 in subspace S2, but t3 gets dominated by t1 inthe full space and subspace S1. In other words, if t1 is not askyline in the query subspace, then it is not necessary toexamine t3 at all. Therefore, we decompose a set ofextended skyline points �þ into two disjoint subsets:1) the skyline set � and 2) the child set b�, whereb� ¼ �þ ��. Given a point t in the child set b�, we definethe parent skyline set of t as follows:

PSðtÞ ¼ ft0 2 � j ðt0 �S tÞ ^ ðt0 6�þS tÞg:

Fig. 9b illustrates the decomposition of �þ in terms of �

and b�, with respect to the example in Fig. 9a. The parent

skyline set of each tuple in b� is also shown in the figure. For

instance, the parent set of t3 contains t1 because t1 dominates

t3 but t1 does not strictly dominates t3. Note that we only

store the parent skyline set as a set of IDs. To avoid random

page accesses (during query processing), each tuple t of b� is

stored with its parent skyline set sequentially in the disk. By

decomposing and storing the extended skyline this way, we

apply Lemma 4 to exploit parent skyline for pruning

unqualified tuples in the child set effectively, thereby

reducing the query CPU cost (i.e., number of dominance

comparisons).

Lemma 4 (Parent-based pruning). Given a sub-data set D0 D and a skyline attribute set S. Let SQ be any skyline attributeset that satisfies SQ S. Given a tuple t 2 b�, if9 t0 2 PSðtÞ; t0 62 �ðD0; SQÞ, then t 62 �ðD0; SQÞ.

Proof. It is given that t0 2 PSðtÞ, so we have t0 �S t. Byimposing the condition SQ S on the above, we obtain8 Ai 2 SQ; t0½Ai� � t½Ai�.

It is also given that t0 62 �ðD0; SQÞ. Thus, there exists atuple t00 2 �ðD0; SQÞ such that t00 �SQ t0. By combiningboth t00 �SQ t0 and 8 Ai 2 SQ; t0½Ai� � t½Ai�, we derivet00 �SQ t. This implies that t does not belong to the resultset �ðD0; SQÞ. tu

Therefore, the idea of the query algorithm is as follows:for a cell g, we first compute the projected skyline �þSQ on theskyline set �þ along the query subspace SQ. These tuples arepart of the final skyline for that cell. Then, we examine thechild set b� and a tuple t 2 b� is pruned when PðtÞ 62 �þSQ .Finally, we compare the remaining tuples with the projectedskyline �þSQ on the query subspace SQ and those still remainas skyline are also added as the final skyline for that cell.

4.2.1 Query Processing Algorithm

In the above discussion, we require the cuboid CþðG; SÞ andthe query QðGQ; SQÞ to have the same group-by attributes,i.e., GQ ¼ G.

We now present the query processing method in Algo-rithm 1, which handles the general case where GQ G. Itassumes that a particular cuboid Cþ has been chosen forprocessing the query. The issues on choosing a cuboid andquantifying its “goodness” will be studied in Section 4.3.


Fig. 8. Example of extended group-by skyline cuboids.

Fig. 9. Extended skyline of a group. (a) A set of extended skyline tuples.(b) Storing extended skyline.


For the general case GQ G, a group instance gQ of GQ

corresponds to a subset Vc of cells of Cþ (see Line 3). AtLines 4-7, we extract the skyline set �ðgcÞ of each cell gc 2 Vcas the set Tfirst, and then compute its skyline (along SQ) asthe candidate set Rfirst. At Lines 9-12, we examine the childset b�ðgcÞ of each cell gc 2 Vc. A tuple t from a child set isinserted into the candidate set Tsecond if all parents of t existin the set Rfirst. Finally, we merge the sets Rfirst and Tsecondtogether, and compute the skyline of the merged tuple set.The specific skyline computation algorithm used at Line 13will be clarified in Section 4.3.

4.3 Partial Materialization

In order to ensure fast online group-by skyline queryprocessing, it is often desirable to precompute/materializethe extended skyline cuboids in the ES-cube. Since it is nota practical option to materialize all ES-cuboids, we focuson partial materialization [15], which is one of the mostpopular adopted methods in commercial products. Given aspecific space budget, we greedily choose to materialize asubset of ES-cuboids that can bring the maximum queryprocessing improvement.

In traditional data cube, the selection of a cuboid formaterialization is based on a linear cost model, i.e., that thecost of evaluating a query using a cuboid C is linear to thesize of C, i.e., the I/O cost of scanning C once (or at mosttwice). Recent work in skyline [7], [38] has mentioned thatthe CPU cost contributes a significant portion of the overallskyline query processing cost. However, does that argu-ment also hold in group-by skyline processing? If yes, weshall really consider the CPU cost in selecting which ES-cuboid to be materialized. To answer the question, we havecarried out an investigation experiment. Here, we use thedefault parameter setting as in the experimental section andwe measure the wall clock time of answering 100 randomvalid group-by skyline queries from a set of materializedES-cuboids. In this paper and in the experiment, we adoptOSP [37], the best skyline algorithm to date, for computingskyline. Our findings are: “only 26.9 percent of the executiontime is spent on the I/O cost (of reading an ES-cuboid from disk),whereas skyline computation shares 55.7 percent of the time. Theremaining 17.4 percent of time contributes to the groupingoperation and the other overhead.”

Therefore, it is important to capture the CPU cost inselecting a cuboid to be materialized. Now, we define thebenefitB of including an ES-cuboid CþðG;SÞ intoMT (whichrepresents the set of ES-cuboids that should be materialized)

as follows: let PAQðCþÞ and CPQðCþÞ be the I/O cost (pageaccesses) and CPU cost (number of comparisons) of evaluat-ing any group-by skyline query QðGQ; SQÞ using an ES-cuboid CþðG;SÞ (where GQ G; SQ S). The benefit B ofincluding ES-cuboid CþðG;SÞ into MT is based on howmuch gain in terms of I/O accesses and tuple comparisons ifCþðG;SÞ is materialized:XGQG;SQS

maxf0;minCþi ðG0;S0Þ2MT ;GQG0;SQS0

T IOðPAQðCþi Þ � PAQðCþÞÞ þ T CP ðCPQðCþi Þ � CPQðCþÞÞg;ð3Þ

where T IO denotes the time of accessing a disk page, T CPdenotes the time of performing a dominance comparison,and Cþi ðG0; S0Þ is any ES-cuboid inMT that could be used toanswer Q. T IO and T CP are machine-specific systemparameters. They are usually stored in the system catalogor can be derived, for example, by measuring the number ofpage accesses and the number of dominance comparisonsthat can be done in 1 second.

Regarding the cuboid selection problem, we apply thegreedy approach (as described in Section 2.1) for pickingcuboids to be included into the setMT . The only differenceis that, (3) is used to compute the benefit of each cuboid.Next, we elaborate the I/O and CPU component costs of (3).

4.3.1 Derivation of PAQðCþÞWe proceed to discuss the first component of benefit, i.e.,the I/O cost PAQðCþÞ of evaluating a query QðGQ; SQÞ byusing an ES-cuboid CþðG; SÞ (where GQ G; SQ S).

We assume that the main memory is large enough to holdthe ES-cuboid Cþ. It suffices to read Cþ from the disk once.Both the grouping and skyline operations are performedsolely in main memory, so they do not incur additional I/Ocost. Thus, PAQðCþÞ can be derived as the I/O cost ofscanning Cþ once. Hence, the I/O cost PAQðCþÞ depends onthe size (i.e., the number of tuples and the number ofattributes) of Cþ but it is independent of Q.

If the domain of the attributes are real values (i.e., veryfew tuples share duplicate values), the size of an ES-cuboidis close to the size of a group-by skyline cuboid. In thatcase, the size of an ES-cuboid can be estimated bysumming the estimated skyline size of each cell of tuplesin the cuboid through the skyline cardinality estimationfunctions developed in [7], [38]. However, when thedomain of an attribute is finite (e.g., an integer domain of(say) [0,99k] for number of rebounds in the NBA data set),the size of an ES-cuboid would be larger than the size of agroup-by skyline cuboid. Consequently, in order tocompute PAQðCþÞ, we first devise techniques to estimatethe size of an ES-cuboid, or essentially, the size of anextended skyline, on a finite domain. After that, we willcome back to the derivation of PAQðCþÞ.

Estimating extended skyline size. Fig. 9a illustrates a setof tuples with two skyline dimensions S1 and S2. Theskyline set contains t1; t2, which collectively form a (dotted)“contour” in the figure and the extended skyline setcontains all tuples on the contour: t1; t2; t3; t4. Suppose thateach skyline attribute has an integer domain ½0; ��, i.e., thenumber of possible values is �þ 1. A tuple t belongs to the



extended skyline if there exists no tuple t0 2 D that strictlydominates it.

We now generalize the above observation to a data setwith a set S of skyline dimensions and N tuples. Let �ðSÞ bethe space of all possible points t, satisfying t½Ai� 2 ½0; ��;8Ai 2 S. Clearly, there are ð�þ 1ÞjSj possible distinct pointsin the space domain. Given a point t 2 �ðSÞ, we define itsprobability density as

pðt; SÞ ¼ jft0 2 D j t0½Ai� ¼ t½Ai�; 8Ai 2 Sgj

Nð4Þ

and its cumulative strict dominated density as

P ðt; SÞ ¼X

t02�ðSÞ; 8Ai2S;t0 ½Ai�<t½Ai�pðt0; SÞ:

Each possible point has a probability of pðt; SÞ being anactual tuple in the data set D, which contains N tuples. Theprobability of t being an extended skyline tuple isð1� P ðt; SÞÞN . Thus, we obtain the size of an extendedskyline of N data tuples as

�ESðNÞ ¼Xt2�ðSÞ

pðt; SÞ 1� P ðt; SÞð ÞN: ð5Þ

The above equations (containing ð�þ 1ÞjSj terms) can beefficiently computed by using a numerical technique calledmidpoint approximation [18].

Derivation of PAQðCþÞ, revisit. Now, we can derivePAQðCþÞ by using (5). Recall that PAQðCþÞ can be simplyderived as the I/O cost of scanning Cþ once, let ! be thenumber of distinct groups in Cþ and each group gi hasjDðgiÞj tuples, we have

PAQðCþÞ ¼1þ jGj þ jSj

BitemX!i¼1

�ESðjDðgiÞjÞ; ð6Þ

whereas the factor ð1þ jGj þ jSjÞ accounts for the size ofa tuple in C (a tuple ID takes 1 unit of storage), and Bitem

represents the number of items/attributes that fit in adisk page.

Given a data set D and a group-by skyline queryQ ¼ ðGQ; SQÞ, the number of distinct groups ! and thenumber of tuples of each group jDðgiÞj can be easily derivedfrom the data distribution using some basic textbookequations. Please refer to [22] for details.

4.3.2 Derivation of CPQðCþÞWe now discuss the second component of benefit (3), i.e.,the CPU cost CPQðCþÞ (i.e., number of dominance compar-isons) of evaluating the query QðGQ; SQÞ by using an ES-cuboid CþðG; SÞ (where GQ G; SQ S) for arbitrary datadistribution. The model is derived based on OSP [37], thealgorithm that we adopted in query processing, which is themost efficient index-free skyline algorithm to date.

Suppose that we have a multidimensional histogram HC

that models the distribution of tuples of Cþ in the space of Swith values in histogram buckets are normalized to have asum of 1. The CPU cost of OSP can be estimated by1) construct a histogram HQ (for the data distribution of Cþin the subspace SQ); 2) apply HQ to estimate the skylinelocations in the space SQ; and 3) build an OSP tree with the

estimated skyline and estimate the query CPU cost. Step 1

can be efficiently done by projecting the dimensions of SQfrom the original histogram HC . Step 2 can be implemented

by the approximate skyline location estimation technique in

[23]. Thus, our main focus is on Step 3.Histogram-based CPU cost estimation of OSP.It remains to discuss how we utilize the histogram HC for

estimating the CPU cost of OSP. We suppose that the

readers are already familiar with how OSP works, as

described in Section 2.2. The subsequent example follows

the setting of the OSP example described in Fig. 4.We shall use the terms “point” and “OSP tree node”

interchangeably in the discussion below. Assume, after

estimating the skyline locations (i.e., after Steps 1 and 2 we

mentioned above), the points of a cell on a subspace SQ ¼fS1; S2g are shown in the left-hand side of Fig. 10 and those

points form an OSP tree on the right-hand side of the figure.Let t be a tree node (with the point t). We use tp to

represent the parent of t; it is null if t is the root. The

bounding region t:� of t is a hyperrectangle formed by its

lower corner t:�� and upper corner t:�þ, which are in turn

defined as

t:��½Si�; t:�þ½Si� ¼0; 1 ; if ðtp ¼ nullÞ;tp:��½Si�; tp½Si�; if ðt½Si� < tp½Si�Þ;tp½Si�; tp:�þ½Si�; if ðt½Si� � tp½Si�Þ:

8<:

For instance, in Fig. 10, node t1 has no parent, so we have

t1:� ¼ ½0; 1Þ � ½0; 1Þ. Node t2 is the “top-left” child of t1 so

we have t2:� ¼ ½0; 0:3Þ � ½0:4; 1Þ.Let v be the next (arbitrary) point in the space to be

visited by OSP, t be a node in the OSP tree, and �ðtÞ be the

set of ancestor tree nodes of t. In OSP, we need to perform a

dominance comparison between v and t iff t gets visited.

This happens when 1) the lower corner of t dominates v, i.e.,

t:�� v, and 2) none of the ancestor nodes of t dominates v,

i.e.,Vt02�ðtÞ t

0 6� v. To capture all the possible points that

cause a node t to be visited, we define a region Rt for t so

that points in Rt satisfy conditions 1 and 2. Consequently,

Rt can be written as Rt ¼ Rtdom �Rt

anc, where Rtdom is the

region dominated by t:�� (condition 1), and Rtanc is the

(union of) region dominated by t0:�� for any ancestor t0 2�ðtÞ (condition 2). For example, for node t2, its lower corner

t2:�� dominates the region Rt2

dom ¼ ½0; 1Þ � ½0:4; 1Þ and its

parent t1 dominates the region Rt2anc ¼ ½0:3; 1Þ � ½0:4; 1Þ.

Thus, any points that fall into the region Rt2dom �Rt2

anc ¼½0; 0:3Þ � ½0:4; 1Þ cause node t2 to be visited.


Fig. 10. Estimation of the comparisons on the OSP tree.


Given the histogram HQ, we use HQðRÞ and HQðv1; v2Þ todenote the probability density of a region R and the regionformed by the corners v1 and v2, respectively. Therefore, thenumber of dominance comparisons of OSP is estimated as

CPQðCþÞ ¼ ! Xt2�

N

!HQðRtÞ

¼ N Xt2�

HQðt:��;UUþQÞ �Xt02�ðtÞ

HQðt0; t0:�þÞ

0@

1A;

where � denotes the set of OSP tree nodes and UUþQ denotesthe upper corner of the (skyline) space domain in HQ.

Note that the above equation is a pessimistic estimate ofthe CPU cost because it does not capture the pruning effectof Lemma 4. Nevertheless, this cost model is adequate toprovide good estimates—our experiments show that thiscost model yields query improvement close to that offeredby an optimal cost model.

4.4 Construction

In this section, we develop an efficient method forconstructing the cuboids of MT , which denotes the set ofchosen ES-cuboids according to Section 4.3. For the runningexample here, suppose that we use the data set shown inFig. 5 and the set MT contains the following ES-cuboids(see their tuples in Fig. 8):

. Aþ : CþðG2G3; S1S2Þ

. Cþ : CþðG2G3; S2Þ

. Dþ : CþðG2; S1S2Þ

. Fþ : CþðG2; S2ÞA simple method for constructing a cuboid CþðG;SÞ 2

MT is to take the data set D as input, apply the group-byoperator on D with the group-by attribute set G, and thenapply the skyline operator (e.g., OSP) as the “aggregationfunction” on each group with the skyline attribute set S.However, this method is slow as it does not exploit thesharing of computations among cuboids.

We then discuss how to construct these ES-cuboids in anefficient manner. Let AMT be the set of constructedcuboids, which is empty in the beginning. According toLemma 3, a cuboid CþðG0; S0Þ can be derived from anancestor cuboid CþðG; SÞ, i.e., G0 G and S0 S (providedthat such a cuboid CþðG; SÞ has been constructed, i.e., itexists in AMT ).

In order to utilize this sharing effect, we propose to sortthe (chosen) ES-cuboids of MT in the topological order oftheir attributes, such that a cuboid appears after all its(possible) ancestors. For instance, the ES-cuboids are sortedin the order: CþðG2G3; S1S2Þ, CþðG2G3; S2Þ, CþðG2; S1S2Þ,and CþðG2; S2Þ.

Continuing with the example, we first build CþðG2G3;S1S2Þ and insert it into AMT . For the next two cuboidsCþðG2G3; S2Þ and CþðG2; S1S2Þ, we build them from theirancestors CþðG2G3; S1S2Þ (according to Lemma 3). Thesetwo cuboids are then inserted into AMT . Finally, weattempt to construct the cuboid CþðG2; S2Þ and find that ithas multiple ancestors: CþðG2G3; S1S2Þ, CþðG2G3; S2Þ,CþðG2; S1S2Þ. These three cuboids contain five, four, andfour tuples, respectively, as shown in Fig. 8.

Regarding the cuboid used for constructing CþðG2; S2Þ, asensible choice would be to use its smallest ancestor cuboidCþðG2G3; S2Þ (having four tuples), which incurs lowconstruction cost.

Algorithm 2 is the pseudocode for constructing ES-cuboids efficiently. It takes the setMT of chosen cuboids asinput. Let AMT be the set of constructed cuboids,initialized to the empty set. First, we sort the cuboids ofMT by the topological order. For each cuboid CþðG0; S0Þ inthe sorted MT , we find its set � of ancestor cuboids thathave been constructed (i.e., found in AMT ). Then, we pickthe smallest ancestor cuboid C� from �, for constructingCþðG0; S0Þ. After that, we insert CþðG0; S0Þ into AMT forsubsequent use.

Algorithm 2. Cuboid Construction Algorithm

5 EXPERIMENTS

In this section, we experimentally evaluate the efficiency andeffectiveness of the proposed technique (we name it as ES).The objective is to evaluate the query I/O cost (i.e., disk pageaccesses), CPU cost (i.e., dominance comparisons), and cubeconstruction time (if applicable) of using the proposedtechnique with respect to a workload of 100 randomlygenerated group-by skyline queries, each of which has arandom number of 1-5 group-by attributes and 1-5 skylineattributes. All the experiments were conducted on an IntelCore 2 Quad 2.83 GHz PC with 4 GB of memory. All thealgorithms were implemented using C++. The page size is setto 4 K Bytes. We use the term “budget-to-data ratio” torepresent the ratio of the data cube budget space (in diskpages) to the data set storage size (in disk pages). The defaultbudget-to-data ratio (for data cube construction) is set to 10:1.

5.1 Description of Data Sets and Methods

We have carried out experiments on synthetic data setsand three real data sets (NBA, Baseball (BB), and ForestCover (FC)).

. NBA. It stores the NBA players’ statistics2 from 1946to now. It contains 20,788 tuples, where each tuplestores the statistics of a player in a season. Weidentify attributes “Year” and “Team” as groupingdimensions and attributes “Games played,”“Points,” “Rebounds,” “Assists,” “Steals,” “Blocks,”“Free throws,” and “Three-point shots” as themeasures (i.e., skyline dimensions).

. Baseball. It stores the baseball batters’ statistics3 from1871 to now. It contains 88,686 tuples, where eachtuple stores the statistics of a batter in a season. We


2. http://basketballreference.com/stats_download.htm.3. http://baseball1.com/statistics/.


identify attributes “Year,” “Team,” and “League” asgrouping dimensions and attributes “Games,” “Atbats,” “Runs,” “Hits,” “Doubles,” and “Runs battedin” as the measures (i.e., skyline dimensions).

. Forest Cover. It is obtained from UCI MachineLearning Repository.4 It stores 581,012 tuples, eachrepresenting a 30� 30 forest land cell. We identifyattributes “Soil type A,” “Soil type B,” “Slope,” and“Cover type” as grouping dimensions and attributes“Roadway distance,” “Vertical hydrology,” “Hori-zontal hydrology,” “Hill-shade at am,” “Hill-shade atnoon,” and “Hill-shade at pm” as skyline dimensions.

. Synthetic data sets. Our synthetic data sets aregenerated similar to the benchmark setting de-scribed in [7]. Unless stated otherwise, each dataset contains N ¼ 1;000;000 tuples, with a total of14 attributes (a1, a2; . . . ; a14): five attributes a1 . . . a5

are for grouping, and nine attributes a6 . . . a14 are forskyline. Specifically, a6 . . . a8 are generated indepen-dently, a9 . . . a11 are correlated, and a12 . . . a14 areanticorrelated. By default, the domain size � of eachgroup-by attribute is 5 and the domain size � of eachskyline attribute is 1,000.

In our study, we mainly compare our approach (ES-cube)

with two baseline approaches:

. Relational (REL). This method evaluates a queryusing a relational engine that supports group-byskyline queries [22]. It does not exploit any materi-alized cuboids.

. Projection data cube (PROJ). This method does notdistinguish between grouping dimensions and sky-line dimensions and follows the method in [15] to

select the cuboids. The cuboid selection algorithmstrictly follows the linear cost model. As a result, itconsiders only the I/O costs by setting T CP of (3) tozero. Also, in order to find skyline, each cellmaterializes the set of tuples belonging to that cell(only the projected values of the selected dimensionsare stored), instead of an aggregated value.

5.2 Experimental Results on Real Data

5.2.1 Comparison with Baseline Approaches

We first compare the performance of ES with the baseline

methods REL and PROJ on real data sets. Fig. 11 shows the

query I/O cost, CPU cost, and construction time of the

methods on the real data sets under different budget-to-data

ratio. Observe that the query cost of REL remains constant as

it does not materialize any cuboids. The query I/O costs

offered by PROJ and ES gradually reduce when the budget-

to-data ratio increases. This is consistent with the behavior of

the greedy materialization on traditional data cubes [15]—the

early selected cuboids provide more significant query cost

savings. Given the same budget, ES is able to materialize

more cuboids than PROJ; thus, ES offers better query I/O cost

saving than PROJ. In addition to the reduction of I/O, ES can

dramatically reduce the CPU cost when compared with PROJ

and REL, even for cases with tight space budget. Note that

PROJ does not reduce any comparisons because the cuboid

materialized by PROJ contains all the tuples (but with fewer

attributes). The construction time rises slowly when the

budget-to-data ratio increases.

5.2.2 Comparison with Other Cost Models

In order to evaluate the proposed cost model, we compare

the performance of ES with two alternate cost models:


Fig. 11. Performance versus budget-to-data ratio, on real data sets. (a) NBA. (b) BB. (c) FC.

4. http://archive.ics.uci.edu/ml/.


. ES-cube with optimal cost model (OPT). Thismethod is a variant of ES. It differs from ES in thefollowing: 1) instead of estimating the cuboid size, itreally materializes every cuboid ahead in order toknow their actual sizes, and 2) it utilizes the actualvalues obtained from the materialized cuboids toconstruct an optimal model for cost estimation. Notethat this method is impractical due to its very highconstruction time and tremendous storage space forstoring all cuboids. It merely serves as an indicatorof the lowest query cost that can be achieved with aperfect cost model.

. ES-cube with random strategy (RAN). This methodis a variant of ES, except that each time a randomcuboid is chosen to be materialized.

Fig. 12 shows the query I/O cost, CPU cost, andconstruction time of these methods on real data sets, as afunction of the budget-to-data ratio. We can see that thequery performance offered by ES is much better than RANand it stays very close to OPT. Furthermore, the construc-tion time of ES is significantly lower than that of OPT.Because of that, we will simply ignore RAN and OPT in theremaining experiments.

5.3 Experimental Results on Synthetic Data

In the following, we study the effects of various parameters

on the performance of the methods.

5.3.1 Varying the Group-By Attribute Domain Size �

Fig. 13a investigates the impact of the group-by attributedomain size � on the construction time and query costs ofthe three methods. Each square-bracketed number (alongthe x-axis) represents the average number of measuredgroups per query in the workload. When � is small, the size

of an ES-cuboid becomes small. Thus, more ES-cuboids

could be constructed from the fixed space budget. During

query processing, small ES-cuboids are accessed for

answering queries, so ES has small query I/O cost. Since

the size of a PROJ-cuboid is independent of �, it has

constant construction time and query I/O cost. At a small �

value, the number of tuples per group in a PROJ-cuboid is

high, causing PROJ to have high query CPU time. However,


Fig. 12. Comparison of cost models, on real data sets. (a) NBA. (b) BB. (c) FC.

Fig. 13. Effect of attribute domain size. (a) Versus group-by domain �.(b) Versus skyline domain size �.


the CPU cost trend of ES is slightly different from that ofPROJ. Even at a small � value, the number of tuples pergroup in an ES-cuboid remains small, and thus the CPUcost of ES is not high.

5.3.2 Varying the Skyline Attribute Domain Size �

We then study the effect of the skyline attribute domain size� on the construction time and query cost of all methods(see Fig. 13b). When � is small, a tuple t has high probabilityof belonging to the extended skyline of its group. As aresult, the size of each materialized cuboid is large, so ESmaterializes few cuboids within the given budget. There-fore, ES incurs more query I/O cost at a low � value. On theother hand, all methods have lower CPU cost at a low �value, as it becomes easier to obtain a tuple with low valuesin skyline attributes, which helps pruning other unqualifiedtuples effectively.

5.3.3 Varying the Data Size N

Fig. 14 shows the construction time and query costs of REL,PROJ, and ES with respect to the data size. When N rises,the extended skyline size of a group increases sublinearlywith N . Thus, the query cost of ES grows sublinearly. Theperformance gap between ES and the baseline methodswidens when N increases.

5.3.4 Varying the Number of Query Attributes

Our default workload contains queries with random 1-5group-by attributes and 1-5 skyline attributes. We nowstudy the effect of group-by and skyline attributes on thequery cost. ES outperforms REL and PROJ in all cases.Fig. 15a shows the query cost when we control thenumber of group-by attributes in queries. Observe that ESsaves significant I/O cost and CPU cost over REL andPROJ in all cases. When a query has a higher number ofgroup-by attributes, ES answers it with a larger cuboid,thus incurring higher I/O cost. However, when there aremore group-by attributes, the number of groups increasesand thus there are fewer tuples (and also skyline tuples)per group, so the CPU cost decreases.

Fig. 15b shows the query cost when we control thenumber of skyline attributes in queries. ES also outperformsREL and PROJ in all cases. When a query contains moreskyline attributes, ES answers it with a larger cuboid, thusincurring higher I/O cost and CPU cost. Since the size of

skyline is OððlnnÞd�1=ðd� 1Þ!Þ (on a uniform data set with ntuples and d attributes) [11], it explains the increasing trendof the query costs of all methods.

6 CONCLUSIONS AND FUTURE WORK

This paper studies the support of group-by skyline queriesin data warehouses by proposing the concept of group-byskyline cube. To implement this concept, we have devel-oped techniques for addressing the following importantissues:

1. the concept of ES-cuboids for re-enabling the sharingof computational efforts between cuboids,

2. an algorithm for answering a query from a materi-alized cuboid efficiently,

3. cost models for facilitating the selection of cuboids tobe materialized for the best query performanceimprovement, and

4. an efficient algorithm for constructing cuboids.

Experimental results show that the proposal techniquessignificantly reduce the query costs in terms of both CPUtime and I/O time.

Our future work includes incremental maintenance ofskyline cube. We also plan to study the integration of otheruseful decision-making operators such as top-k into datawarehouses as “aggregate functions.” Furthermore,throughout this paper, we assume that smaller values arepreferable over larger ones, for skyline attributes. We plan toremove this limitation by defining the concept of generic ES-cuboid, which consists of the union of ES-cuboids obtainedfrom different combinations of preferences on skylineattributes. Such a generic ES-cuboid can then be used toanswer a query irrespective of its preferences on attributes.An important issue would be to study an efficient methodfor estimating the size of a generic ES-cuboid, withouthaving to enumerate all combinations of preferences.

ACKNOWLEDGMENTS

The research was supported by grant PolyU 525009E fromHong Kong RGC.


Fig. 14. Effect of the data size.

Fig. 15. Effect of the number of query attributes. (a) Versus group-byattributes. (b) Versus skyline attributes.


REFERENCES

[1] W.T. Balke, U. Guntzer, and J.X. Zheng, “Efficient DistributedSkylining for Web Information Systems,” Proc. Int’l Conf. Extend-ing Database Technology (EDBT), 2004.

[2] I. Bartolini, P. Ciaccia, and M. Patella, “Efficient Sort-BasedSkyline Evaluation,” ACM Trans. Database Systems, vol. 33, no. 4,2008.

[3] K.S. Beyer et al., “Bottom-Up Computation of Sparse and IcebergCUBEs,” Proc. ACM SIGMOD Int’l Conf. Management of Data, 1999.

[4] S. Borzsonyi et al., “The Skyline Operator,” Proc. Int’l Conf. DataEng. (ICDE), 2001.

[5] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang,“Finding k-Dominant Skylines in High Dimensional Space,” Proc.ACM SIGMOD Int’l Conf. Management of Data, 2006.

[6] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang, “OnHigh Dimensional Skylines,” Proc. Int’l Conf. Extending DatabaseTechnology (EDBT), 2006.

[7] S. Chaudhuri et al., “Robust Cardinality and Cost Estimation forSkyline Operator,” Proc. Int’l Conf. Data Eng. (ICDE), 2006.

[8] J. Chomicki et al., “Skyline with Presorting,” Proc. Int’l Conf. DataEng. (ICDE), 2003.

[9] B. Cui, H. Lu, Q. Xu, L. Chen, Y. Dai, and Y. Zhou, “ParallelDistributed Processing of Constrained Skyline Queries by Filter-ing,” Proc. IEEE Int’l Conf. Data Eng. (ICDE), 2008.

[10] E. Dellis et al., “Efficient Computation of Reverse SkylineQueries,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2007.

[11] P. Godfrey, “Skyline Cardinality for Relational Processing,” Proc.Int’l Symp. Foundations of Information and Knowledge Systems(FoIKS), 2004.

[12] P. Godfrey, R. Shipley, and J. Gryz, “Algorithms and Analysis forMaximal Vector Computation,” VLDB J., vol. 16, no. 1, pp. 5-28,2007.

[13] J. Gray et al., “Data Cube: A Relational Aggregation OperatorGeneralizing Group-By, Cross-Tab, and Sub-Total,” Proc. Int’lConf. Data Eng. (ICDE), 1996.

[14] A. Gupta and I.S. Mumick, Materialized Views: Techniques,Implementations, and Applications. MIT Press, 1999.

[15] V. Harinarayan et al., “Implementing Data Cubes Efficiently,”Proc. ACM SIGMOD Int’l Conf. Management of Data, 1996.

[16] W. Jin, A.K.H. Tung, M. Ester, and J. Han, “On EfficientProcessing of Subspace Skyline Queries on High DimensionalData,” Proc. Int’l Conf. Scientific and Statistical Database Management(SSDBM), 2007.

[17] D. Kossmann, F. Ramsak, and S. Rost, “Shooting Stars in the Sky:An Online Algorithm for Skyline Queries,” Proc. Int’l Conf. VeryLarge Data Bases (VLDB), 2002.

[18] A.R. Krommer and C.W. Ueberhuber, Numerical Integration onAdvanced Computer Systems. Springer, 1994.

[19] H.T. Kung et al., “On Finding the Maxima of a Set of Vectors,”J. ACM, vol. 22, no. 4, pp. 469-476, 1975.

[20] C. Li, B.C. Ooi, A. Tung, and S. Wang, “DADA: A Data Cube forDominant Relationship Analysis,” Proc. ACM SIGMOD Int’l Conf.Management of Data, 2006.

[21] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang, “Selecting Stars: The kMost Representative Skyline Operator,” Proc. Int’l Conf. Data Eng.(ICDE), 2007.

[22] M.-H. Luk et al., “Group-By Skyline Query Processing inRelational Engines,” Proc. ACM Conf. Information and KnowledgeManagement (CIKM), 2009.

[23] D. Papadias et al., “Progressive Skyline Computation in DatabaseSystems,” ACM Trans. Database Systems, vol. 30, no. 1, pp. 41-82,2005.

[24] J. Pei et al., “Catching the Best Views of Skyline: A SemanticApproach Based on Decisive Subspaces,” Proc. Int’l Conf. VeryLarge Data Bases (VLDB), 2005.

[25] J. Pei et al., “Towards Multidimensional Subspace SkylineAnalysis,” ACM Trans. Database Systems, vol. 31, no. 4, pp. 1335-1381, 2006.

[26] J. Pei et al., “Computing Compressed Multidimensional SkylineCubes Efficiently,” Proc. Int’l Conf. Data Eng. (ICDE), 2007.

[27] K.A. Ross et al., “Fast Computation of Sparse Datacubes,” Proc.Int’l Conf. Very Large Data Bases (VLDB), 1997.

[28] M. Sharifzadeh and C. Shahabi, “The Spatial Skyline Queries,”Proc. Int’l Conf. Very Large Data Bases (VLDB), 2006.

[29] Y. Sismanis et al., “Dwarf: Shrinking the PetaCube,” Proc. ACMSIGMOD Int’l Conf. Management of Data, 2002.

[30] K.-L. Tan, P.-K. Eng, and B.C. Ooi, “Efficient Progressive SkylineComputation,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2001.

[31] Y. Tao et al., “Efficient Skyline and Top-k Retrieval in Subspaces,”IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1072-1088,Aug. 2007.

[32] Y. Tao, X. Xiao, and J. Pei, “SUBSKY: Efficient Computation ofSkylines in Subspaces,” Proc. Int’l Conf. Data Eng. (ICDE), 2006.

[33] A. Vlachou et al., “SKYPEER: Efficient Subspace Skyline Compu-tation over Distributed Data,” Proc. Int’l Conf. Data Eng. (ICDE),pp. 416-425, 2007.

[34] T. Xia et al., “Refreshing the Sky: The Compressed Skycube withEfficient Support for Frequent Updates,” Proc. ACM SIGMOD Int’lConf. Management of Data, pp. 491-502, 2006.

[35] D. Xin et al., “P-Cube: Answering Preference Queries in Multi-Dimensional Space,” Proc. Int’l Conf. Data Eng. (ICDE), 2008.

[36] Y. Yuan et al., “Efficient Computation of the Skyline Cube,” Proc.Int’l Conf. Very Large Data Bases (VLDB), 2005.

[37] S. Zhang et al., “Scalable Skyline Computation Using Object-BasedSpace Partitioning,” Proc. ACM SIGMOD Int’l Conf. Management ofData, 2009.

[38] Z. Zhang et al., “Kernel-Based Skyline Cardinality Estimation,”Proc. ACM SIGMOD Int’l Conf. Management of Data, 2009.

Man Lung Yiu received the bachelor’s degree incomputer engineering and the PhD degree incomputer science from the University of HongKong in 2002 and 2006, respectively. Prior to hiscurrent post, he worked at Aalborg University forthree years starting in the Fall of 2006. He is nowan assistant professor in the Department ofComputing, Hong Kong Polytechnic University.His research focuses on the management ofcomplex data, in particular query processing

topics on spatiotemporal data and multidimensional data.

Eric Lo received the PhD degree in 2007 fromETH Zurich. He is currently an assistant profes-sor in the Department of Computing, Hong KongPolytechnic University. His research interestsinclude databases and software engineering.

Duncan Yung received the bachelor’s degree incomputer science in 2009 from the University ofHong Kong, Hong Kong. He is currently workingtoward the MPhil degree at the Department ofComputing, Hong Kong Polytechnic University,Hong Kong, under the supervision of Dr. Eric Lo.His research focuses on different kinds of spatialdata management and authentication.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.