11

Click here to load reader

Aggreagate awareness

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Aggreagate awareness

Y. Manolopoulos and P. Návrat (Eds): ADBIS 2002, pp. 41-51, 2002.

MF-Retarget: Aggregate Awareness in Multiple FactTable Schema Data Warehouses

Karin Becker, Duncan Dubugras Ruiz, and Kellyne Santos

Faculdade de Informática – Pontifícia Universidade Católica do Rio Grande do Sulhttp://www.inf.pucrs.br/{~kbecker | ~duncan}

{kbecker, duncan} @inf.pucrs.br, [email protected]

Abstract. Performance is a critical issue in Data Warehouse systems (DWs),due to the large amounts of data manipulated, and the type of analysisperformed. A common technique used to improve performance is the use ofpre-computed aggregate data, but the use of aggregates must be transparent forDW users. In this work, we present MF-Retarget, a query retargetingmechanism that deals with both conventional star schemas and multiple facttable (MFT) schemas. This type of multidimensional schema is often used toimplement a DW using distinct, but interrelated Data Marts. The paper presentsthe retargeting algorithm and initial performance tests.

1 Introduction

Data warehouses (DW) are analytical databases aimed at providing intuitive access toinformation useful for decision-making processes. A Data Mart (DM), often referredto as a subject-oriented DW, represents a subset of the DW, comprised of relevantdata for a particular business function (e.g. marketing, sales). DW/DM handle largevolumes of data, and they are often designed using a star schema, which containsrelatively few tables and well-defined join paths. On-line Analytical Processing(OLAP) systems are the predominant front-end tools used in DW environments,which typically explore this multidimensional data structure [3, 13]. OLAP operations(e.g. drill down, roll up, slice and dice) typically result in SQL queries in whichaggregation functions (e.g. SUM, COUNT) are applied to fact table attributes, usingdimension table attributes as grouping columns (group by clause).

A multiple fact tables (MFT) schema is a variation of the star schema, in whichthere are several fact tables, necessary to represent unrelated facts, facts of differentgranularity, or even to improve performance [10]. A major use of MFT schemas is toimplement a DW through a set of distributed subject-oriented DMs [8, 10], preferablyrelated through a set of conformed dimensions [6], i.e. dimensions that have the samemeaning at every possible fact table to which it can be joined. In such architecture, amajor responsibility of the central DW design team is to establish, publish andenforce the conformed dimensions. However, these efforts of the design team are notenough to guarantee the ease combination, by end users, of facts coming from morethan one DM. Indeed, the straightforward join of facts and dimensions in MFTschemas imposes a number of restrictions, which are not always possible to be

Page 2: Aggreagate awareness

42 Karin Becker et al.

observed, otherwise, one risks to produce incorrect results. Most users do not have thetechnical skills for realising the involved subtleties and their implications in terms ofquery formulation. Therefore, for most users, queries involving MFT schemas aremore easily handled through appropriate interfaces or specific applications that hidefrom them all difficulties involved.

In this paper, we propose MF-Retarget, a query retargeting mechanism that handlesMFT schemas and which is additionally aggregate aware. Indeed, precomputedaggregation is one of the most efficient performance strategies to solve queries in DWenvironments [8]. The retargeting service provides users with transparency from bothan aggregate retargeting perspective (aggregate unawareness) and multiple fact tables’schema complexity perspective, freeing users from query formulation idiosyncrasies.The algorithm is generic to work properly regardless the number of fact tablesinvolved in the query.

The remainder of this paper is structured as follows. Section 2 presents relatedwork on the use of aggregates. The retargeting algorithm is described in Section 3,and Section 4 presents some initial performance tests. Conclusions and future workare addressed in Section 5.

2 Related Work

2.1 Computation of Aggregates

In a DW, most frequently users are interested in some level of summarisation oforiginal data. One of the most efficient strategies for handling this problem is the useof pre-computed aggregates for the combination of dimensions/dimension attributesproviding the greatest benefit for answering queries (e.g. frequent or expensivequeries) [4, 8,15, 16].

The computation of aggregates can be dynamic or static. In the former case, it is upto the OLAP tool or database engine to decide which aggregates are “beneficial”, aconcept that varies from tool to tool. Works such as [1, 2, 4, 5, 7, 14] address dynamiccomputation of aggregates. These approaches are different from the static context inthat not only the cost of executing the query is considered, but alsomaintenance/reorganisation costs which takes place as query is processed [2, 5]. Inthe static context, aggregates are created off-line, and therefore maintenance/reorganisation costs are not that critical. It should be clear that dynamic and staticaggregate computations are complementary mechanisms. The first addressesperformance tuning from a technical perspective. The latter, addressed in this paper, isessential from a corporate point of view.

Organisationally, static aggregate computation is fundamental because aggregatesare created based on corporate decisional requirements, prioritising types of analysisor types of users. Of course, decisional support requirements vary overtime, so it isfundamental that the DBA monitors the use of the analytical database in order torevise the necessity of existing and/or new aggregates.

Design alternatives for representing aggregates are extensively discussed inpragmatic literature such as [6, 10]. Storing each aggregate in its own fact table

Page 3: Aggreagate awareness

MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 43

presents many advantages in terms of easiness of manipulation, maintenance,performance and storage requirements. Aggregation also leads to smallerrepresentations of dimensions, commonly referred to as shrunken dimensions.Aggregates should, whenever possible, refer to shrunken dimensions, instead oforiginal dimensions. A shrunken dimension is commonly stored in a separate tablewith its own primary key.

User tools or applications should not reference the aggregate to be used in SQLqueries. First, it must be possible to include/remove aggregates without affectingusers or existing applications. Second, users cannot be in charge of performanceimprovement by the selection of the appropriate aggregate.

2.2 Aggregate Retargeting Services

There are three major options where query-retargeting services can be located in theDW architecture: the desktop, the OLAP server or the database engine [16]. Thequery retargeting service can also be located in between these layers, in case noaccess to the DBMS engine/OLAP tool source code is provided. Most works in theliterature (e.g. [1, 2, 3, 4, 5, 7, 14]) focus on dynamic computation of aggregates,considering strategies that are embedded in query processors, such that the retargetingservice can change completely the query execution plan. Dynamic aggregation alsoconsiders a specific moment of user analysis (e.g. a sequence of related drills), andnot the organisational requirements as a whole.

Kimball et al. [6] sketch a query-retargeting algorithm for statically pre-computedaggregates, which could be inserted as a layer between front-end toll andOLAP/server DBMS engine. The algorithm is based on the concept of “family ofschemas”, composed of one base fact table and all of its related aggregate tables. Oneof the advantages of such algorithm is that it requires very few metadata, basically thesize of each fact table and the available attributes for each aggregate. In this paper, weextend this algorithm to deal with MFT schemas. Such an extension is useful for DWarchitectures implemented by a set of subject-oriented set of DMs, in which userswishes to performed both separate and integrated analysis.

3 MF-Retarget

The striking feature of MF-Retarget is its ability to handle MFT schemas with the useof aggregates, yet providing total transparency for users. The joining of several facttables requires in the general case that each individual table must be summarisedindividually first until all tables are in the same summarising level (exactly the samedimensions), and then joined. However, most users do not have the technical skills forrealising the problems involved, nor the requirements in terms of query formulation.See [11] for a deeper discussion on the subtleties involved. Additionally, it shouldbenefit from the use of aggregates as a query performance tuning mechanism. Hence,transparency in the context of MFT schemas considered must have a twofoldmeaning: a) aggregate unawareness, and b) MFT join complexity unawareness.

Page 4: Aggreagate awareness

44 Karin Becker et al.

MF-Retarget is a retargeting service intended to lie between the front-end tool andthe DBMS, which accepts as input a query written by a user through a proper userinterface (e.g. a graphical one intended for naive users, a specific application). Thealgorithm assumes that:

− Users are unaware of MFT joining complexities, and always write a singlequery in terms of desired facts and dimensions. The retargeting service isresponsible for rewriting the query to produce correct results, assuming as apremise that it is always necessary to bring each fact table to the samesummarisation level before joining them.

− Users are unaware of the existence of aggregates, and always formulate thequery in terms of the original base tables and dimensions. The retargetingservice is responsible for rewriting the query in terms of aggregates, if possible.

− Retargeting queries involving a single fact table is a special case of MFTschemas, and therefore, the algorithm should provide good results in both cases.

The remainder of this section addresses an illustration scenario and describes thealgorithm. Further details on the algorithm, the required metadata and MF_Retargetprototype can be obtained in [11].

3.3 Algorithm Illustration

To illustrate the functioning of the algorithm, let us consider the example depicted inFigure 1, a simplification of the MFT schema proposed in the APB-1 OLAP Councilbenchmark [9]. For each fact table the fields prefixed by * compose its primary key.In dimension tables, only one field, the lower one in the hierarchy, composes itsprimary key (also prefixed by *). The branches show the referential integrity from afact or aggregate table for each of its dimensions. The MFT schema of Figure 1(a)shows two fact tables (Sales and Inventory), related by three conformed dimensions:Customer, TimeDim and Product. Sales have an additional dimension, namelyChannel. Also, Figure 1(a) shows some possible aggregates for this schema. In thepicture, grey boxes correspond to shrunken dimensions, i.e. hierarchic dimensionswithout one or more lower-level fields.

For example, consider a user wishes to analyse comparatively quarterly Sales ofproduct divisions, with the corresponding status of the Inventory. This query cannotbe answered simply by joining facts of distinct tables, because these facts representinformation on different granularity, and therefore, they should be brought to thesame summarisation level before they can be joined, otherwise inaccurate results willbe produced. To free the user from the difficulties involved in MFT schemas, MF-Retarget assumes the user states a single query in terms of facts, dimensions anddesired aggregation level (the input shown in Figure 2, for the example considered).The retargeting mechanism has then two goals: to correct the query, and try to make itmore efficient with the use of aggregates.

Considering the aggregates illustrated in Figure 1(a), the algorithm realises thatAggregate4 is the best candidate to answer the question, because it contains allnecessary data, is the smaller one, and already joins Sales and Inventory tables (inthat order). In the absence of Aggregate4, Aggregate1 and Aggregate2 will be used.If the algorithm does not find any aggregate that can answer the query in a more

Page 5: Aggreagate awareness

MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 45

efficient way, at least it transforms the query to produce correct results. Figure 3shows the results from the algorithm for these three situations.

It should be pointed out that the best aggregate is not always the one that alreadyjoins distinct fact tables. Indeed, in case smaller individual aggregates exist, the costof joining them can be smaller than the cost of summarising a much bigger joined pre-computed aggregate.

3.4 The Algorithm

The algorithm assumes that users have to inform only the tables (fact/dimensions), thegrouping columns (which are the same ones listed in the select clause), thesummarisation functions applied to the measurements, and possibly additionalrestrictions in the where clause. It considers the following restrictions to inputqueries: a) monoblock queries (select from where group by); b) only transitiveaggregation functions are used; c) all dimensions listed in the from clause apply to allfact tables listed.

For the algorithm, the relationship between schemas is represented by a directedacyclic graph G(V, E). In the graph, V represents a set of star schemas, and Ecorresponds to set of derivation relationships between any two schemas. The edges ofE form derivation paths, meaning that schema at the end of any path could be derivedby the aggregation of schema related to the start of that path. The use of graphstructures for representing relationships between aggregates is well known [4, 13].Figure 1(b) presents the derivation graph for the example of Figure 1(a). We assumethat only transitive aggregation functions (i.e. SUM, MAX and MIN) are used in boththe derivation relationships and queries.

Fig. 1. MFT schema and possible aggregates, and schema derivation graph

Retailer *Store

Customer

Year Quarter *Month

TimeDim

Division Line Family Group Class *Code

Product

*Cust_ID *Prod_ID *Chan_ID *Time_ID UnitsSold

DollarSales

Sales

*Cust_ID *Prod_ID *Time_ID StockUnits

Inventory

*Base

Channel

*Cust_ID *Prod_ID *Chan_ID *Time_ID UnitsSold

DollarSales

Aggregate1

Year *Quarter

Sh_TimeDim *Cust_ID *Prod_ID *Time_ID

StockUnits

Aggregate2

*Prod_ID *Time_ID

StockUnits

Aggregate3

*Retailer

Sh_Customer

Division *Line

Sh_Product

*Prod_ID *Time_ID UnitsSold

DollarSales StockUnits

Aggregate4 Sales Inventory

Aggregate3 Aggregate2 Aggregate1

Aggregate4

(a)

(b)

Page 6: Aggreagate awareness

46 Karin Becker et al.

“The units sold and units in stock, per quarter and product division”Select P.Division Division, T.Quarter Quarter, SUM(S.UnitsSold) UnitsSold,SUM(I.StockUnits) StockUnits From TimeDim T, Product P, Sales S, Inventory IWhere T.Month=S.Time_ID and P.Code=S.Prod_ID and T.Month=I.Time_ID and P.Code=I.Prod_ID Group by P.Division, T.Quarter

Fig. 2. Input SQL query from a naive DW user

a) considering the existence of Aggregate4:Select P.Division Division, T.Quarter Quarter, SUM (UnitsSold) UnitsSold,SUM (StockUnits) StockUnits From Sh_TimeDim T, Sh_Product P, Aggregate4 A4Where T.Quarter=A4.Time_ID and P.Line=A4.Prod_ID Group by P.Division, T.Quarter

b) in the absence of Aggregate4:Create view V1 (Division, Quarter, UnitsSold) as Select P.Division, T.Quarter,SUM (S.UnitsSold) From Sh_TimeDim T, Product P, Aggregate1 A1Where T.Quarter=A1.Time_ID and P.Code=A1.Prod_ID Group by P.Division, T.Quarter

Create view V2 (Division, Quarter, StockUnits) as Select P.Division, T.Quarter,SUM (I.StockUnits) From Sh_TimeDim T, Product P, Aggregate2 A2Where T.Quarter=A2.Time_ID and P.Code=A2.Prod_ID Group by P.Division, T.Quarter

Select V1.Division Division, V1.Quarter Quarter, UnitsSold, StockUnitsFrom V1, V2 Where V1.Division = V2.Division and V1.Quarter = V2.Quarter

c) if no aggregates are found:Create view V1 (Division, Quarter, UnitsSold) asSelect P.Division, T.Quarter, SUM (S.UnitsSold) From TimeDim T, Product P, Sales SWhere T.Month=S.Time_ID and P.Code=S.Prod_ID Group by P.Division, T.Quarter

Create view V2 (Division, Quarter, StockUnits) as Select P.Division, T.Quarter,SUM (I.StockUnits) From TimeDim T, Product P, Inventory IWhere T.Month=I.Time_ID and P.Code= I.Prod_ID Group by P.Division, T.Quarter

Select V1.Division Division, V1.Quarter Quarter, UnitsSold, StockUnitsFrom V1, V2 Where V1.Division = V2.Division and V1.Quarter = V2.Quarter

Fig. 3. Possible outputs of the algorithm

The algorithm is divided into 4 steps, which for clarity purposes are individuallypresented and illustrated using the example of Section 3.1:

1. Divide the original query into component queries;2. For each component query, select candidate schema(s) for answering the query;3. Select best candidates.4. Rewrite the query

Step 1: Division into Component Queries.For each fact table Fi listed in the from clause of the original query Q, a componentquery Ci (i>0) is created, according to the following algorithm:1. For each fact table Fi listed in the from clause of Q, create a component query

Ci such that:1.1. Ci from clause := Fi and all dimensions listed in the from clause of Q;1.2. Ci where clause := all join conditions of Q necessary to relate Fi to the

dimensions, together with any additional conditions involving thesedimensions or Fi;

1.3. Ci group by clause := all attributes used in the group by clause of Q;1.4. Ci select clause := all attributes used in the group by clause of Q, in

addition to all aggregation function(s) applied to Fi attributes.

Page 7: Aggreagate awareness

MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 47

Notice that a query referring a single fact table is treated as a special case of queriesinvolving several fact tables. In that case, Step 1 produces a single component querythat is equal to the original query Q. Figure 4 shows the component queries createdfor the query input illustrated in Figure 2.

C1 -> Select P.Division, T.Quarter, SUM (S.UnitsSold) UnitsSoldFrom TimeDim T, Product P, Sales SWhere T.Month=S.Time_ID and P.Code=S.Prod_ID Group by P.Division, T.Quarter

C2 -> Select P.Division, T.Quarter, SUM (I.StockUnits) StockUnitsFrom TimeDim T, Product P, Inventory IWhere T.Month=I.Time_ID and P.Code=I.Prod_ID Group by P.Division, T.Quarter

Fig. 4. Component queries for input of Figure 2

Step 2: Candidates for Component QueriesThis step generates for each component query Ci (i>0) resulting from Step 1 therespective candidate set CSi. Each candidate belonging to CSi is a schema (base oraggregate) that answers Ci.

2 For each component query Ci, generated in Step 1:2.1. Let n:= the node that corresponds to the base schema of Ci; mark n as

“visited”; let CSi := n;2.2. Using a depth-first traversing, examine all schemas derived from n until all

nodes that can be reach from it are marked as “visited”;2.2.1. Let n := next node; mark n as “visited”;2.2.2. If all query attributes (select and where clauses) of Ci belong to

schema n and each aggregation function of the select clause of Ci isexactly the same of one used in the fact table of schema nThen CSi := CSi ∪∪∪∪ n;Else Mark all nodes that can be reached from n as “visited”.

Each time this step is executed for a component query, the graph is traversed usinga depth-first algorithm starting from the corresponding base schema. When thealgorithm detects that a schema cannot answer the component query, all schemas atthe end of a derivation path starting from it are disregarded. Each CSi element is avalid candidate to replace Fi in the from clause of Q. Therefore, every tuple (e1, …,en), where e1 ∈∈∈∈ CS1, …, en ∈∈∈∈ CSn, are valid combinations for the rewrite of Q.Considering the graph depicted in Figure 1(b), and the component queries of Figure 4:

− CS1 = {Sales, Aggregate1, Aggregate4} for component query C1;− CS2 = {Inventory, Aggregate2, Aggregate3, Aggregate4} for C2.

Step 3: Selection of Best CandidatesLet T be a set of tuples (e1, …, en), where e1 ∈∈∈∈ CS1, …, en ∈∈∈∈ CSn, (n>0), representingthe Cartesian Product of candidate sets CS1 X .. X CSn. Let t be a tuple of T. Thepresent version of the algorithm bases this choice on the concept of accumulated sizeto choose the best candidate. The accumulated size of t(e1, …, en), AS(t), is a functionthat returns the sum of records that must be handled if the query were rewritten usingt. For summing the number of records, AS(t) computes only once the size of a given

Page 8: Aggreagate awareness

48 Karin Becker et al.

table, in case it is included in more than one candidate set CSi. Thus, if Aggregate4 ischosen only its records need to be processed, and only once. In all other cases, recordsfrom the different fact tables in t are processed, considering each table only once.

This may suggest that, in a multi-fact query, the best t will ever be the one wheree1 = …= en. However, this is not true. Indeed, the cost of processing more records(I/O cost) has a stronger impact than the cost of joining tables. Notice that this stepcan be improved in many ways, by varying the cost function used to prioritise thecandidates for query rewrite. An immediate improvement of this function is theconsideration of index information to be combined with table size.

3 Consider all CSi sets generated in Step 2 and T, the Cartesian Product ofcandidate sets CS1 X .. X CSn

3.1. t' := t(e1, …, en) ∈∈∈∈ T with the smallest accumulated AS(t), considering allt(e1, …, en) ∈∈∈∈ T, where e1 ∈∈∈∈ CS1, …, en ∈∈∈∈ CSn;

Step 4: Query ReformulationOnce the best candidate for each component query is determined, the query isrewritten. If the set of best candidates resulting from Step 3 has a single element, i.e.a common aggregate for all component queries, a single query is written using suchaggregate and respective shrunken dimensions. This is the case for our example,where Figure 3(a) displays the rewritten query. Otherwise, the query is rewritten interms of views that summarise the best aggregates individually, and then join them(e.g. as in Figure 3(b) and (c)). If there is a common best candidate that answers morethan one component query, but not all of them, a single view is created for that set ofcomponent queries. This trivial algorithm is not presented due to space limitations.

4 Tests

Initial performance tests were performed based on the MFT schema presented in theAPB-1 OLAP Benchmark [9], which comprises 4 fact tables. For the tests, we usedInventory, Sales and corresponding dimensions, as depicted in Figure 1(a). Theaggregates were defined to experiment performance under different levels ofaggregation (compression factor). We did not use the semantics of the aggregates, noruser requirements expressed in the benchmark (e.g. queries). We also disregarded thenumber of records of the resulting database for aggregate selection. Two tests wereexecuted, referred to Test1 and Test2, described in the remaining of this section.

4.1 Test1

The goal of Test1 was to verify whether the algorithm performed well and correctly,considering both star and MFT schemas. APB-1 program was executed withparameters 10, 0.1 and 10, which resulted in a Sales table with 1,239,300 records,whereas Inventory comprised 270,000 records. We executed five queries, three ofthem involving a single fact table (Sales), and two of them the join of both fact tables.For each query, we calculated the number of records of the resulting table, and theprocessing time considering all possible alternatives.

Page 9: Aggreagate awareness

MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 49

Fig. 5. Derivation graph used in Test2

I2S2 I3 S3

S1 I1

Inventory Sales

It was possible to verify that the algorithm always chose the aggregate with thesmallest processing time, regardless whether the query involved a single fact table ora join of multiple fact tables. Proportionally, there was a significant gain in the vastmajority of cases, but absolute gains were not always significant. The magnitude ofperformance gain seems to be a function of the (fact) table size, aggregatecompression factor, and output table size.

4.2 Test2

Test2 was executed running APB-1 withparameters 10, 1 and 10. The derivationgraph for this test is depicted in Figure 5,and Table 1 describes the properties of theschemas: number of records, compressionfactor (CF) with regard to both the basefact table and its deriving schema(s), andthe difference between the derived/deriving schemas in terms of dimensions.

We executed a single query thatinvolved facts from both Sales and Inventory tables, and which could be answered by(the combination of) all aggregates. The goal was to compare the respective absoluteprocessing times. Table 2 displays in the first row the elapsed time measured for eachexecution. Subsequent rows show the gains in using an aggregate with regard tobigger alternatives. For instance, the use of aggregate I2S2 represents a gain of 52%with respect to the join of I1 and S1, calculated as (time(I1 and S1) –time(I2S2))/time(I1 and S1), and 96% with regard to the join of Inventory andSales. It is possible to verify that the gains considering absolute times are verysignificant this time. The use of the accumulated size AS(t) as the main criterion toprioritise aggregate candidates seems to be simple but efficient, although it still can beimproved in many ways, particularly indexing.

5 Conclusions

In this work we presented MF-Retarget, a retargeting mechanism that deals with bothMFT schemas and statically computed aggregates. The algorithm provides two typesof transparency:

a) aggregate unawareness, andb) users are spared from the complexities of queries in MFT schemas.

This retargeting service is intended to be implemented as a layer between userfront-end tool and DBMS engine. Thus, it can be complementary to gains alreadyprovided by OLAP tools/DBMS engines in the context of dynamic computation ofaggregates. Further details on the implementation can be obtained in [11].

Page 10: Aggreagate awareness

50 Karin Becker et al.

Tab. 1. Inventory/Sales derivation graph description

Table orAggregate

Records CF (baseschema)

Derivingschema

CF(Derivingschema)

ShrunkenDims.

Eliminateddims.

Sales (S) 13,122,000S1 2,400,948 18.3 % Sales 18.3 % Yes YesS3 614,520 4.7 % S1 25.6% Yes

Inventory (I) 12,396,150I1 2,496,762 20.1% Inventory 20.1% YesI3 631,800 5.1 % I1 25.3 % Yes

I2S2 2,400,948 18.3 % (S)19.4 % (I)

S1 and I1 96.2% I1100 % S1

Tab. 2. Results with larger tables

Query1: 614,520 rec.Inventory and Sales I1 and S1 I2S2 I3 and S3

Time (hours:min:sec) 20:35:43 1:27:04 0:42:09 0:15:22Inventory/Sales (%) 93 96 99I1 and S1 (%) 52 82I2S2 (%) 64AS(t) 25,518,150 4,897,710 2,400,948 1,246,320

Preliminary tests confirmed the algorithm always provided the best response time.Proportional gains are always significant, but absolute gains increase with bigger facttables. It is obvious that additional tests are required to determine precise directivesfor the construction of aggregates in MFT schemas, and under which circumstancesthe processing gains are significant. It is also important to refine our criteria forselecting the best candidate. It is also important to use indexing information inaddition to table number of records for aggregate selection.

Future work also includes, among other topics, the better definition of functioncosts for prioritising candidate aggregates, use of indexes in the function cost, theintegration of the retargeting mechanism into a DW architecture, support foraggregates monitoring and recommendation for aggregates reorganisation, the use ofthe proposed algorithm in the context of dynamic aggregate computation.

Acknowledgements

This work was partially financed by FAPERGS, Brazil.

References

1. Baralis, E., Paraboshi, S., Teniente, E. Materialized Views Selection in a MultidimensionalDatabase. Proceedings of the VLDB’97 (1997). 156-165.

2. Chaudhuri, S., Shim, K. An Overview of Cost-Based Optimization of Queries withAggregates. Bulletin of TCDE (IEEE), v. 18 n. 3 (1995). 3-9.

Page 11: Aggreagate awareness

MF-Retarget: Aggregate Awareness in Multiple Fact Table Schema Data Warehouses 51

3. Gray, J., Chaudhuri, S. et al. Data cube: a relational aggregation operator generalizingGroup-by, Cross-tab and Subtotals. Data Mining and Knowledge Discovery, v. 1, n. 1(1997) 29-53.

4. Gupta, A., Harinarayan, V., Quass, D. Aggregate-query Processing in Data WarehousingEnvironments. In: Proceedings of the VLDB’95 (1995). 358-369.

5. Gupta, H.; Mumick, I. Selection of views to materialize under a maintenance cost constraint.Proceedings of the ICDT (1999). 453-470.

6. Kimball, R. et al. The Data Warehouse Lifecycle Toolkit : expert methods for designing,developing, and deploying data warehouses. John Wiley & Sons, (1998)

7. Kotodis, Y., Roussopoulos, N. Dynamat : A Dynamic View Management System for DataWarehouse. Proceedings of the ACM SIGMOD 1999. (1999) 371-382

8. Meredith, M., Khader, A. Divide and Aggregate: designing large warehouses. DatabaseProgramming & Design. June, (1996). 24-30.

9. OLAP Council. APB-1 OLAP Benchmark (Release II). Online. Captured in May 2000.Available at : http://www.olapcouncil.org/research/bmarkco.htm, Nov. 1998.

10.Poe, V., Klauer, P., Brobst, S. Building a Data Warehouse for Decision Support 2nd edition.Prentice Hall (1998).

11.Santos, K. MF-Retarget: a multiple-fact table aggregate retargeting mechanism. MSc.Dissertation. Faculdade de Informatica - PUCRS, Brazil. (2000). (in Portuguese)

12.Sapia, C. On modeling and Predicting Query Behavior in OLAP Systems. Proceedings ofthe DMDW’99. (1999)

13.Sarawagi, S. Agrawal, R. Gupta, A. On Computing the Data Cube. IBM Almaden ResearchCenter : Technical Report. San Jose, CA (1996).

14.Srivasta, D., et al. Answering Queries with Aggregation Using Views. Proceedings of theVLDB’96 (1996). 318-329

15.Wekind, H., et al. Preaggregation in Multidimensional Data Warehouse Environments.Proceedings of the ISDSS'97 (1997). 581-59

16.Winter, R. Be Aggregate Aware. Intelligent Enterprise Magazine, v. 2 n. 13 (1999).