34
Distributed and Parallel Databases, 2, 371-404 (1994) © 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Page-Query Compaction of Secondary Memory Auxiliary Databases NABIL KAMEL Database Systems Research and Development Center, Computer and Information sciences Department, University of Florida, Gainesville, FL 32611 Received July 20, 1992; Revised Apri126, 1994 Recommended by: Ramez Elmasri Abstract. Prestoring redundant data in secondary memory auxiliary databases is an idea that can often yield improved retrieval performance through better clustering of related data. The clusters can be based on either whole query results or, as this paper indicates, on more specialized units called page-queries. The deliberate redundancy introduced by the designer is typically accompanied by much unnecessary redundancy among the elements of the auxiliary database. This paper presents algorithms for efficiently removing unwanted redundancy in auxiliary databases organized into page-query units. The algorithms presented here extend prior work done for secondary memory compaction in two respects: First, since it is generally not possible to remove all unwanted redundancies, the paper shows how can the compaction be done to remove the most undesirable redundancy from a system performance point-of-view. For example, among the factors considered in determining the worst redundancies are the update behavior and the effects of a particular compaction scheme on memory utilization. Second, unlike traditional approaches for database compaction which aim merely at reducing the storage space, this paper considers the paging characteristics in deciding on an optimal compaction scheme. This is done through the use of page-queries. Simulation results are presented and indicate that page-query compaction results in less storage requirements and more time savings than could be obtained by standard non-page-query compaction. Keywords: Page query, databases, caching, secondary storage, performance query optimization 1. Introduction Modern applications of databases place increasing demands on performance. This fact coupled with rapidly increasing memory capacities (both main and secondary) has motivated a number of researchers to investigate the use of large memories to speed query processing. Most of the work in this area seems to have concentrated on developing theories for querying and updating fragments of relations which are usually assumed to be main-memory resident. Let us call the main body of this redundant data the Auxiliary Data Base, the ADB. If this data is main memory resident, then it is called the Main Memory Auxiliary Data Base, the MADB. If on the other hand the auxiliary data base is placed on the disk, then it is called the Secondary Memory Data Base, the SADB. In the most general case, however, the ADB will contain both a MADB component and a SADB component with the interface being invisible to the user. Usually, the ADB consists of materialized results of query expressions. Those expressions could represent view definitions, alerter predicates, triggering conditions, the left hand side expressions of rules in a rule-based system, or any such query expressions which are known at compile time.

Page-query compaction of secondary memory auxiliary databases

Embed Size (px)

Citation preview

Page 1: Page-query compaction of secondary memory auxiliary databases

Distributed and Parallel Databases, 2, 371-404 (1994) © 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Page-Query Compaction of Secondary Memory Auxiliary Databases NABIL KAMEL

Database Systems Research and Development Center, Computer and Information sciences Department, University of Florida, Gainesville, FL 32611

Received July 20, 1992; Revised Apri126, 1994

Recommended by: Ramez Elmasri

Abstract. Prestoring redundant data in secondary memory auxiliary databases is an idea that can often yield improved retrieval performance through better clustering of related data. The clusters can be based on either whole query results or, as this paper indicates, on more specialized units called page-queries. The deliberate redundancy introduced by the designer is typically accompanied by much unnecessary redundancy among the elements of the auxiliary database. This paper presents algorithms for efficiently removing unwanted redundancy in auxiliary databases organized into page-query units. The algorithms presented here extend prior work done for secondary memory compaction in two respects: First, since it is generally not possible to remove all unwanted redundancies, the paper shows how can the compaction be done to remove the most undesirable redundancy from a system performance point-of-view. For example, among the factors considered in determining the worst redundancies are the update behavior and the effects of a particular compaction scheme on memory utilization. Second, unlike traditional approaches for database compaction which aim merely at reducing the storage space, this paper considers the paging characteristics in deciding on an optimal compaction scheme. This is done through the use of page-queries. Simulation results are presented and indicate that page-query compaction results in less storage requirements and more time savings than could be obtained by standard non-page-query compaction.

Keywords: Page query, databases, caching, secondary storage, performance query optimization

1. Introduction

Modern applications of databases place increasing demands on performance. This fact coupled with rapidly increasing memory capacities (both main and secondary) has motivated a number of researchers to investigate the use of large memories to speed query processing. Most of the work in this area seems to have concentrated on developing theories for querying and updating fragments of relations which are usually assumed to be main-memory resident.

Let us call the main body of this redundant data the Auxiliary Data Base, the ADB. If this data is main memory resident, then it is called the Main Memory Auxiliary Data Base, the MADB. If on the other hand the auxiliary data base is placed on the disk, then it is called the Secondary Memory Data Base, the SADB. In the most general case, however, the ADB will contain both a MADB component and a SADB component with the interface being invisible to the user. Usually, the ADB consists of materialized results of query expressions. Those expressions could represent view definitions, alerter predicates, triggering conditions, the left hand side expressions of rules in a rule-based system, or any such query expressions which are known at compile time.

Page 2: Page-query compaction of secondary memory auxiliary databases

372 KAMEL

Since those query expressions are independently defined and chosen, overlaps among their result tuples may arise randomly. These overlaps represent unintentional redundan- cies which serve no useful purpose. Not only are these overlaps useless, but they are potentially very harmful because they result in wasted storage space, they reduce the effec- tive capacity of the ADB, and hence its effectiveness in optimizing retrieval performance, and they cause degradation in the update performance of the system. This degradation re- sults from the need to update multiple copies of the same data item in the ADB. The harmful effects of this unintentional redundancy with respect to update performance are particularly well pronounced in secondary memory auxiliary databases because of the longer access times involved.

Much of these harmful effects can be greatly reduced by carefully compacting the data. The compaction process must take into account the effects of anticipated update patterns on the resulting performance when choosing a particular compaction scheme over another. This paper presents methods and algorithms to perform optimal compactions in a SADB subject to update traffic. This work differs from prior work in two respects. First, in the incorporation of update information in determining an optimal compaction scheme and second, in its use of special constructs called page-queries to organize the compacted data.

A page-query is a small data packet consisting roughly of the result of processing one query predicate out of one memory page. Using page-queries as the primary units of information retention in the SADB allows partial query results to be stored in the SADB. Furthermore, each page-query will result in saving one memory page access during retrieval time with no duplication of contributions among different page-queries. Using page-queries in the SADB results in an increased discrimination granularity in determining the population of the SADB. Just as with whole query results, arbitrarily defined page-queries overlap randomly and require further compaction.

The rest of this paper is organized as follows: The next section elaborates on prior research done in this area. Section 3 presents the notation used and provides an overview of the organization and operation of the SADB, the target of the compaction process. Section 4 presents formal definitions of the notions of page-queries. Section 5 presents the problem of compacting the SADB and provides the algorithms used to solve it. This is done for two cases: 1) if the SADB is organized as a collection of whole queries (subsections 6.2 and 6.3) and 2) if the SADB were built entirely from page-queries (subsection 6.4). Section 7 presents some simulation results and section 8 ends with concluding remarks.

2. Related work

The work in this paper stems from the work done on redundancy and the need to represent redundant data as efficiently as possible. Work in redundancy has been done in the contexts of fragments of relations and view materialization.

Fragments of relations: The data in the SADB can be viewed as special types of relational fragments. Thus, this work is also related to work done on fragments of relations. Some algorithms for translating queries made against the base relations into equivalent queries made against the fragments (and vise versa) have been proposed in the literature. Rosenkrantz and Hunt [13] give a digraph representation of a special class of predicates and

Page 3: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 373

give polynomial time algorithms for the satisfiability, equivalence and reduction problems when the ~ sign is not permitted. Sun and Kamel [ 14] extend these results by presenting an efficient solution to the implication problem without first converting it to the satisfiability problem. Solving the implication problem for queries is central to querying fragments of relations. The work in [9] makes use of some of these results by showing how they can be used to facilitate querying the SADB. The work done by Maier and Ullman in [12] also deserves mention here as they address a number of interesting questions [ 12] about relations constructed by the union and selection operations from other relations. In [11], the authors give an interesting approach to the questions of computability and derivability of queries from fragments constructed by arbitrary relational algebra expressions.

View materialization: Other work related to the question of redundancy has also been done in the context of view materialization. The work in [1] represents an approach for efficient update of materialized views when the base relations are subject to updates. This approach, however, would only be suitable for the MADB portion of our ADB but not for the SADB. In our case, this is due to the transitory nature of the SADB which is the primary difference between the SADB and a materialized view. The approach adopted in this paper for the SADB is to detect irrelevant and autonomously computable updates as defined by Blakeley, Larson and Tompa in [2] where methods for detecting them are given. These updates can be carried out immediately in the SADB without much overhead. All other updates are deferred until their affected page-queries are requested. The main database is maintained separately and can be assumed to be always consistent. The formal model developed in this paper, however, is capable of modeling all types of updates. None of the work surveyed above, however, directly addresses the issues related to compacting the redundant data (whether fragments or materialized views) explicitly.

There are two situations for compaction which are addressed in this paper: compacting a SADB which contains whole queries and one which is organized as a collection of page-queries.

Compacting whole queries: The problem of compacting whole queries for disk resi- dency has first been discussed by Kou in [ 10] where he identified the major problems without giving specific solutions. Ghosh and Gupta ([4] and [5]) have developed many theoretical results concerning the properties of consecutively retrievable data organizations but without referring explicitly to the question of updates. The same work was extended by Deogun, Raghavan, and Tsou ([3]) in a database context by addressing the issue when no nesting is allowed between the queries and without updates. They show the importance of the class of problems for which they have developed their algorithms. The storage model for all this work is invariably that of a linear storage space, the same model is used in this paper.

Compacting page-query databases: The page-query concept was first introduced in [7] and further developed in [8] for artificial intelligence applications using large main memories, and in [9] for secondary memory auxiliary databases. The current paper extends this work further by showing how to compact auxiliary databases organized as collections of arbitrarily overlapping page-queries and in the presence of updates.

Page 4: Page-query compaction of secondary memory auxiliary databases

374 KAMEL

Queries

I Query Processor~-~ Page-Query Directory

Main Databas SADB

Query Answer

MADB

Figure 1. System overview.

3. Notation and redundancy system overview

This section gives a brief overview of the underlying redundancy system which is the target of the current compaction study. As shown by Figure 1, There are five major components of the redundancy system:

(1) The main database, where the bulk of the disk-resident data is located.

(2) The Main-memory Auxiliary Database (the MADB) which contains the most frequently requested set of page-queries. The data in this small database is derived from the main database.

(3) The Secondary-memory Auxiliary Database (the SADB) which also contains a set of derived page-queries. This database holds those page-queries which are less frequently referenced than those in the MADB. The SADB can be much larger than the MADB and is the target of the current compaction study.

(4) The query processor. In addition to performing the regular query processing functions, the query processor would also have to make routing decisions. The query may be directed to any one of the three available databases (the main database, the MADB, and the SADB) or it may be split into two or three subexpressions, each directed to a different database. In making these decisions, the query processor chooses the least costly alternative by cost estimation.

Page 5: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 375

(5) The last component in the page-query directory. This directory is consulted by the query processor before making its query routing decisions. In the next three subsections, brief descriptions of the notation used, the assumed query model, the method used to handle updates, and how to use the SADB in query evaluation are given.

3.1. Query model, usage pattern model, disk model, and general notation

This section states the basic models used for the queries, the usage pattern, and the disk. A list of the general notation used throughout the paper can be found in the appendix.

The query model: The basis for our query model is the conjunctive unequalsfree mixed predicate [13]. This is defined as a conjunction of comparisons employing the five com- parison operators ( = < > > = > = ) . Each comparison may be of any one of three types:

(1) A comparison with a constant such as A > 12

(2) A comparison between two variables such as A = B

(3) A comparison between two variables with an offset such as A < B + 9

Any relational algebra query involving selections, projections, and joins (PSJ-expressions) [11] can be written in the form:

(R1 × R2- . - × /~k){C}[A1,A2, . . . ,Az]

where R1, R2, . . . , Rk are relations, C is a conjunctive unequalsfree mixed predicate, and A1, A 2 , . . . , Az are the final projection attributes. Join operations are encoded in C (also called the condition) in the form of two-attribute comparisons (types 2 and 3 above). Note that R1, R2, . . . , Rk do not have to all be distinct. However, if the same relation appears twice, then it must be renamed and all its attributes must also be renamed.

The condition C can be viewed as a boolean conjunction of variable comparisons of the 3 types shown above as in predicate calculus where each variable is either free or is implicitly existentially quantified. All free variables appear in the projection part of the query and no variable appearing in the projection part of the query is bound. Since any PSJ expression can be represented in this notation, the standard relational algebra operators: SELECT, PROJECT, and JOIN will be alternately used.

The usage pat tern model: Let us assume that at the beginning of each time interval, an estimation of the anticipated query patterns for the next time interval is made available. These patterns are assumed to be given in the form of a number of retrieval, deletion, and insertion queries. These anticipated queries are denoted by qzj where x is a letter designator that takes one of three values: r, d, i to denote retrieval, deletion, or insertion queries respectively, j is a sequentially assigned integer designating each anticipated query. The term retrieval, deletion, or insertion region will be occasionally used in reference to these queries. Associated with each anticipated query qxj is an anticipated frequency of reference f~j. Finally, n~ is used to denote the total number of distinct anticipated queries

Page 6: Page-query compaction of secondary memory auxiliary databases

376 I,;AMEL

of type x. For example, n~ refers to the total number of anticipated retrieval queries in the system.

The disk model: The disk model used here is that of a linear storage space. This means that getting the disk head from one page in a linear sequence of pages to another is proportional to the distance between the two points on the sequence. This is essentially the same model used by Kou in [10] in his study of consecutive retrieval problems. In this paper, the main effort is devoted to clustering strongly correlated SADB data in consecutive blocks. This is achieved by the selective introduction of controlled redundancy.

General notation: A list of the general notation used in this paper can be found in the appendix. Here some general guidelines on the use of notation in this paper is given. The notation used was made as concise as possible. Pointed brackets < > are used to enclose page-queries. The notation used inside the pointed brackets to denote page-queries consists of two components: 1) one or more page ids and 2) one or more query ids. These two components are separated by a hyphen. Thus, < Pi - q r j > refers to a page- query whose components are Pi and qrj. Similarly, < (Pil, Pi2,. • • , P i n ) - - q r j > refers to a composite page-query (defined precisely in section 4) whose components are all the pages: Pil ,Pi2, . . . ,Pin and the query qrj. Since this involves several pages but only one query, a plural 's' is added to the word 'page' and thus, naming this construct: pages- query. Likewise, the construct: <Pi - (qil, q i 2 , . . . , q i r a ) > is named page-queries. To avoid possible confusion between the last construct and a collection of page-queries, the paper always makes this distinction whenever the possibility of confusion arises. The term page-group is used to indicate the construct whose components are one page and a group of queries. A group is defined as a consecutively retrievable collection of queries with no redundancy. Gj is used to denote a group and <Pi - Gj > to denote a page-group. The construct containing several pages and several queries is not useful in the optimization problems discussed and is therefore not named. Generally, small letters for pages and queries are used to refer to ids while capital letters are used to refer to contents. Thus, Pi refers to a page number i while Pi refers to its contents and the same with queries. Vertical bars around a quantity refer to its size. For example, I Q~j I refers to the size of the response set of query q~j. The next subsection discusses how updates are handled.

3.2. Handling updates

As mentioned earlier, the compaction algorithms developed in this paper take the update overhead into account when selecting a particular compaction scheme over another. To understand how this is done it is necessary to understand how the proposed system handles updates. The term update overhead refers to the extra number of page accesses required to update a database with a SADB component over what would have been required had there

been no SADB component. Let us first examine all the possible sources of this overhead and then classify the types of update operations according to the amount of overhead they incur in presence of the SADB. The strategy used to handle updates is also indicated. Sources of overhead: The SADB update overhead comes from two sources: 1) Overhead required to recompute the contents of the SADB, and 2) Overhead required to actually perform the update on the SADB. Let us consider each one of these in turn.

Page 7: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 377

Overhead required to recompute the contents of the SADB: There are three types of updates with respect to a prestored SADB query:

(1) Irrelevant updates. This update can never affect the contents of the prestored query.

(2) Autonomously computable updates. The effects of these updates on a prestored query can be determined without having to recompute the query.

(3) General updates. Determining the effects of these updates on the prestored query does in general require an incremental recomputation of the query. This recomputation, however, can be deferred until the query is requested again. The issue of whether or not to defer the recomputation of a materialized query has been addressed in [6].

Detecting irrelevant and autonomously computable updates can be done based on the query model chosen. Efficient ways to do this have been described in detail in [2] and will be duplicated here.

Overhead required to actually perform the update on the SADB: Since the SADB may contain several copies of a given tuple, this type of overhead can be controlled by controlling the number of copies of tuples subject to update operations. In this paper, considerable attention has been devoted to minimizing this overhead by modeling its effects and directly basing the optimal SADB content compaction on the overhead estimates returned by the resulting model.

3.3. Using the SADB in query evaluation

This section gives a brief discussion on how the SADB is used in query evaluation. A more detailed discussion can be found in [9].

The SADB consists of a collection of partially evaluated response sets of queries stored in memory. Given an arbitrary query, it is required that its response set be found as quickly as possible. Every incoming retrieval query is processed in the following 5 steps:

Step 1. The query expression is decomposed by the query processor into a number of subexpressions organized into a query evaluation tree.

Step 2. Each subexpression is normalized to a unique form.

Step 3. Starting from the root of the query evaluation tree, every subexpression is matched up in the SADB directory to all the page-queries which had been prestored for it (if any). The tree is traversed in a breadth-first order because any page-query that can be used in evaluating a certain subexpression can also be used in evaluating all of its descendants.

Step 4. The subexpressions of the tree are evaluated from the main database as usual with the following modification: All the pages which had been preprocessed in the SADB relative to a certain subexpression are ignored from the search to evaluate this subexpression and from the searches to evaluate all its descendant subexpressions.

Step 5. The final result of each subexpression in the tree is assembled from the results obtained from the main database search and from any available page-queries obtained from

Page 8: Page-query compaction of secondary memory auxiliary databases

378 KAMEL

the SADB. According to our definitions of page-queries, this assembly process should not involve additional page accesses. Only main memory operations are needed to perform the assembly process.

4. The concept of page-queries

The definition of page-queries is dependent on the type of query in question. This section gives exact definitions for the three basic types of queries: selections, projections, and joins as well as all their combinations.

4.1. Selection page-queries

A selection page-query <p~ - qj > is the simplest type of page-queries:

s is defined as the result of applying the Definition 1. A selection page-query <Pi - %j > s selection query qrj to page Pi. Thus:

s s <Pi -- qrj > -: Qrj N Pi

4.2. Projection page-queries

The definition given for selection page-queries is not suitable in the case of projection queries because it does not eliminate all duplicates, as the following example shows:

Example 1. Suppose a certain database has two pages containing the integers: (1 2 3) (2 3 3). Applying a projection query qr P to each of the pages separately will produce the

P {123} and <P2 - %t> = two projection page-queries: <Pl - qrl > = P {23}.

The problem of having overlapping page-queries for the same query, like the example above, is that it becomes difficult to estimate the space requirements of a set of such page-queries. Clearly, If both page-queries are stored, there will be no need to store the duplicates twice. Since a major goal of our strategy is to separate the selection process from the compaction process, an alternate definition of projection-page queries that will not result in overlapping page-queries is given.

Definition 2. A projection page-query <Pi - qP >: Sk is defined with respect to a sequence of pages: Sk = Pl, P2 , . - . ~ Pn by applying the projection to page i and eliminating from the result any data values existing in any page pj where j < i in the sequence. In our

P "(12) {123} previous example, the two projection page-queries become: < P l - - qrl >" =

and <P2 - %z p > ' ( 1 2 ) = {13}. On the other hand, <Pl - q,-tP >-'(21) = {23} and <P2 - qr~>(12) = {1}. To reconstruct the response of a given projection query, one needs to combine all its page-queries taken with respect to the same page sequence. Some main memory processing may be necessary to eliminate duplicates between the page-queries in

Page 9: Page-query compaction of secondary memory auxiliary databases

P A G E - Q U E R Y C O M P A C T I O N 379

the SADB and those evaluated from the main database, but no additional page accesses will be needed.

4.3. Join page-queries

A simple join page-query involves two or more relations in its definition. Those represent the join operands.

Definition 3. A heterogeneous join page-query <(Pi,* ) J - qrl > is defined as the result of applying the join query qd to page pi as the first operand, and to all pages of the other relation as the second operand. More generally, a heterogeneous multi-join page-query < (Pi,*) - q~zJ > is defined as the result of applying the join query q~ to page p~ and all the pages of all the relations not containing page Pi.

Note that since p~ is allowed to come from any relation, it is possible to have some undesirable duplication among different page-queries. This duplication arises from joining the same set of pages in different orders.

Example 2. Consider a multi-way join query: q~dt joining three relations R1, R2, andR3. Suppose that each relation R~ contains two pages: Pil and Pi2. The three page-queries: <(p l l , * ) J <(p21,*) J * J - - qrt >' - - qrt >, and < ( P 3 1 , ) - qrz > consist of the following three- page joins:

<(P11,* ) J - - q r l > = P l l - - P 2 1 - - P31,P11 -- P21 -- P32,P11 -- P22 -- P31,

P l l - - P 2 2 - - P 3 2 ,

<(P21,* ) J - - q r l ~ = - P l l - - P 2 1 - - P 3 1 , P l l - - P 2 1 - - P 3 2 ~ P 1 2 - - P 2 1 - - P 3 1 ,

P12 - P21 - P32, and ) J

- - q r l ~ : P l l - - P 2 1 - - P 3 1 , P l l - - P 2 2 - - P 3 1 , P 1 2 - - P 2 1 - - P 3 1 ,

P 1 2 - - P 2 2 - - P 3 1 ,

The degree of unnecessary redundancy among these page-queries is evident. In order to avoid this redundancy, the notion of semi-join page-queries is introduced. These will also be referred to simply as join page-queries.

J Definition 4. A semi-join page-query <Pi - qr j> is defined as the result of joining one page, pi from one relation as the first operand with all page from all other relations as the remaining operands. Only data tuples from page Pi are retained. Data from the remaining pages are replaced by pointers.

Figure 2 depicts the storage structure of join page-queries, in order to detail the internal structure of a join page-query, subscripted arrows will be used to denote a pointer to tuples from the pages denoted by the subscripts.

Thus, the structure in the' dashed box in the previous figure is denoted by Pl l , pll +- ---~p21,p22. Applying this definition to example 2 produces the following page-queries:

Page 10: Page-query compaction of secondary memory auxiliary databases

380 KAMEL

R1 R2

P l l

P12

P21

P22

q = R1 JOIN R2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I I

Tuples from P 11 t Tuples from P21 and P22 I

[ ]

[ ]

]

' ] I

I . . . . . . . . . . . . . . . . . . I Main dambase

Semi-joimpage-query<p11-q>

J This is also denoted by Figure 2. A diagram depicting semi-join page-query < P l - qrj>. q(Pla ),q,-j(p11) *'-- ----+q,-j(P21,P22) or simply P l l , pa l *-- --+p21,p22.

P l l ~ p l l +'-- ---+p21~p22---+p31~p32. P21~ p214--- ---+pll~p12---+p31~p32. P31~ p31 +'-- ----+pll,p12----+p21,p22.

Note the absence of overlap in the data in the page-queries. permitted to overlap for this type of join-only query.

Pointers, however, are

4.4. Projection-selection-join page-queries

Definitions 1, 2 and 4 above can be combined in an arbitrary manner. Let us examine how this can be done for a set of relations R1, R2,. • •, Rn and an arbitrary PSJ query qPSJ defined on them. Assume that the predefined page sequences for the relations are $1, $2,. •. Sn. Let S~ be a global sequence of all the pages of R1, R 2 , . . . , and R,~. The sequence S~ is used to establish the order of performing the projections of the query as required by definition 2. A general strategy for deriving a number of page-queries is as follows:

(1) Stamp each attribute value with a page number indicating the page it originates from.

(2) Evaluate all the selections first. Group the resulting tuples according to the pages they come from. This is done easily based on the page stamps.

Page 11: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 381

(3) Next evaluate all the joins.

(4) Finally perform the projections. At this stage, the page-queries will be delineated with respect to the global sequences. According to the proposed strategy, all attribute values specified by the final projection which come from the first page in the sequence,

~SPJ pilwill be retained. This will represent page-query <Pil - ~j .~. In general, the final projection would specify attributes from more than one relation and the data structure used to represent semi-join page-queries would be used. Next, retain all attribute values as specified by the final projection which come from the page in the sequence except those tuples already selected by a previous page-query. This will be the page-query

~SPJ <Pi2 - (/j /-. The process continues until all pages have been been scanned.

In determining the overlaps among page-queries, one can distinguish three types of over- laps: 1) data-data overlaps: these are overlaps among data values of different page- queries, 2) data-pointe~" overlaps: these are overlaps between data values in one page- query and a pointer in another page-query that points to the same data item, and 3) pointer- pointer overlaps: these are overlaps among the pointers (i.e. the pointers point to exactly the same location). All three types of overlaps are resolved by the above strategy. Let us see an example.

Example 3. Given two relations RI(A, B) and R2(C, D) and assuming that each relation occupies two memory pages and that each page has a capacity of two tuples. Let the two relations be:

R1 I i B

1 6 P1 3 8

1 7 P2 5 7

and

R2

C D 6 2

P3 8 4

7 2 P4 8 4

Consider the query qj = (Rt x R2){(B = C)}[A, D]. The join query results in the intermediate relation:

Page 12: Page-query compaction of secondary memory auxiliary databases

382 KAMEL

rid A B page-id C D page-id ta 1 6 1 6 2 3 tz 3 8 1 8 4 3 t3 3 8 1 8 4 4 t4 1 7 2 7 2 4 t5 5 7 2 7 2 4

If the global sequence S~ were chosen to be Pl,P2,P3,P4, then the following set of page-queries are obtained before removing the overlaps and without using the semi-joins structures:

< p l - q j > = {(1 , 2); (3, 4); (a, 4 )}

<P2 - qj> -- {(1, 2); (5, 2)}

< ; 3 - q j > = {(1 , 2); (a, 4 )}

< ; 4 - q j > = {(a , a); (1, 2); (5, 2)}

Removing the overlaps results in:

<Pl - q j > = {(1, 2); (3, 4)}

< ; ~ - q j > = { (~ , 2 )}

<;a - qj> = {(1, 2); (3, 4)}

< ; 4 - q j > = { (5 , 2 ) }

The bold numbers above are the attribute values which will actually be stored as part of the page-query. All other numbers will be replaced by pointers.

Finally, the page-queries are converted to the semi-join structure with pointers. The resulting page-queries are:

<p~ - q j > = {(1,--,~1); (a, ~ 2 ) }

< ; 2 - q j > = {(5 , ~ 5 ) }

<P3 - qj> = {(tl +- 2); ( t2~ 5)}

<P4 - qj> = {(t5 +-- 2)}

Each one of these page-queries can be weighted separately for selection in the SADB as all overlaps have been eliminated. The weight of each page-query depends on its size, frequency of use, frequency of reference, and the amount of overlap with other page-queries derived from different queries. The weighting method is the subject of the coming sections, however, let us give here the formula for the size of a generalized Project-Select-Join page-query:

I < p i - q > l = n~ wk + p n q / = m i + l

where n~ is the number of tuples in page i of relation R satisfying the query, mi is the number of attributes from relation R specified in the final projection of the query, wk is

Page 13: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 383

the width of attribute k in bytes, p is the pointer length in bytes, m2 is the total number of relations participating in the join and contributing some attribute to the final projection, and n q is the total number of tuples in the final result originating from page Pi of the page-query <Pi - q> in question. Note that the attributes of the final join are numbered sequentially as indicated by the query model, repeated here for convenience:

(R1 x -R2 • -- x -Rk){C}[A1, A2 , . . . , Az]

Significance: Note that the above definition of generalized page-queries combines two heterogeneous optimization tools in one structure. It provides a uniform treatment for the complementary problems of choosing an optimal partial data materialization and an optimal set of partial indexes. This is evident by considering the structure of the general- ized page-queries: they contain both a data component and a pointer component. Either can be missing. Further, overlaps among pointers and data are taken into account when choosing the population of the SADB and when deciding on a compaction scheme. The next section discusses the weighting of page-queries and the effects of compaction on the weighting process.

5. Assessing the benefits of page-queries

Page-queries are designed to allow the assignment of a single weight to each page-query. This weight reflects the amount of saving in page accesses per byte stored of the page-query in question. In [9], a performance model is presented based on which, closed formulae to assign those weights are also given. The situation changes, however, when considering the effects of compaction on the weighting function. Let us discuss the interplay between the process of selecting the best page-queries and compaction.

The compaction process can be applied either before or after the selection of the contents of the SADB. If the compaction were applied first, the function used to weigh the page- queries must be modified to take the compaction scheme under which the selection is to be performed into account. This is done by simultaneously evaluating the weights of several page-queries instead of weighing them separately. The particular set of page-queries to be evaluated together depends on the compaction scheme.

It is not possible to give a closed form solution for the weighing function W i j for each page-query applied to compacted data, for the following two reasons:

(1) The amount of space needed to store a page-query is not known until the entire selection process terminates. Note that, due to compaction, the amount of storage needed to store the contents of a certain page-query <Pi - qrj > may be less than the number of tuples in page Pi satisfying query qrj. As an example, let us consider the query situation in Figure 3. The figure shows two retrieval queries qrl and qr2 and one deletion query qal. To the left a segment of the main database is shown. Page pi contains five tuples satisfying query qrl four of which also satisfy qr2. Note the existence of three tuples satisfying all three queries qrl, qr2 and qal. To estimate the storage requirements for p a g e - q u e r y <Pi - q r l > it must be known whether page-query <Pi - %2> is stored or not. Obviously, if <Pi - q~2> is not stored, then the storage requirements for

Page 14: Page-query compaction of secondary memory auxiliary databases

384 KAMEL

Pi

5

qr2

qdl

Figure3 . A query intersection case showing the interdependence of page-query weights ifthe data were compacted.

<Pi - qrl > is four tuples. On the other hand, if <Pi - qr2> were selected and stored, only one more tuple space needs to be reserved for <Pi - q~l>. This is the classical egg and chicken problem. The (optimal) selection cannot be done without knowing which page-queries will be selected.

(2) Likewise, the amount of overhead needed to update the SADB due to insertions and deletions cannot be estimated before knowing which page-queries will be selected. To clarify this point let us consider again the query situation of Figure 3. To estimate the cost of performing the update query on the data tuples satisfying query qr2 in page pi of the main database, it must be known whether or not the data tuples satisfying q~l in page pi are selected. If they were, obviously no extra cost is incurred in maintaining the tuples satisfying qr2 in page Pi. If they were not, then the three tuples will need to be updated every time qal is issued and an estimate of this cost will need to be included in the weighing function of < P i - - qr2 >.

Note that the real problem is caused by the existence of some tuples in an intersection area between two (or more) retrieval regions (retrieval regions were defined in section 2.3). This modeling problem can be avoided if instead of evaluating every page-query separately, all intersecting queries having tuples in their intersection areas which come from Pi are evaluated together. For the example shown in Figure 3, this would mean that instead of assigning two separate weights to <Pi - qrl > and <p~ - q~2 >, only one weight is assigned to <p~ - (q~l, qr2)>. The result of this modification will be a slightly reduced selection granularity.

To summarize, the selection procedure applied to uncompacted data can be done by weighing all the page-queries individually and selecting the ones with the highest weights to populate the SADB. The same selection problem cannot be applied to compacted data because the weighing function of page-queries requires a knowledge of: 1) the space requirements of a page-query without knowing how much of these requirements are shared with those of other page-queries and 2) the update overhead caused by storing a page-query redundantly without taking into account the benefits obtainable by sharing this overhead with other page-queries. The problem is solved by grouping the page-queries which have

Page 15: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 385

common tuples together in a way which will become clearer in section 6.4 where the method used to group the page-queries is described by means of examples and where an algorithm to perform the weighting function is given. The next section deals with the compaction problem.

6. Compacting the SADB

This section describes the different algorithms used to compact the SADB. These algo- rithms are applied periodically after each reorganization of the SADB occurs. This process is performed during low transaction activity periods and results in better response time during high activity periods. The process of reorganizing the SADB consists of two steps: 1) selecting the parts of the data which are to be stored redundantly and 2) compacting the SADB.

The selection process is based on the anticipated usage pattern as well as on the actual data distribution at the time of the reorganization. On termination, the selection process results in determining a number of page-queries which, when stored redundantly, will optimize performance during the next time period [9].

The above selection process does not say anything about how those sets should be stored. As will be illustrated shortly, the data to be stored in the SADB will in general contain a large amount of unnecessary internal redundancy. The process of trimming this unnecessary redundancy from the SADB is called the compaction process. Care must be taken when performing the compaction process to cluster related data next to each other on the disk.

Take for example the case of two intersecting page-queries, A and B. Obviously, if this data is stored in three contiguous memory areas holding: A U B - B , A n B, and A U B - A , each of A and B can still be found in a contiguous memory area. Furthermore, an amount of storage equal to the number of tuples in the set A N B would be saved. The benefits of this compaction process go beyond the obvious storage reduction illustrated by the above example. The savings also include time savings when doing updates. To illustrate this, let us assume that some of the tuples lying in the intersection region of A and B of the above example are to be deleted. If no compaction for the data has been done, the deletion operation will have to be performed twice (once for each copy). If the data were compacted, the deletion will have to be done only once.

If more than one deletion operation is to be attempted for different tuples lying in the same area of intersection, proportionately as many data access operations will be avoided. Also, for the case of ~z nested page-queries, every deletion operation done in the area of the innermost page-query can be done once as compared to n repetitions if no compaction is done. Another favorable side effect of this compaction process is allowing more tuples to fit in the SADB space (if wan'anted) and thus achieving more time savings during the next time period. The next section defines the problem. Sections 6.2 and 6.3 describe in detail the compaction process.

Page 16: Page-query compaction of secondary memory auxiliary databases

386 KAMEL

qr2

qrl qr3

qil • •

• • qr2 [ l + 1 . . . . . . . . . . . . .

• qdll | i l

............ ,....i.?;?.I.-.I.. ..... I

b

Figure 4. Typical input to the compaction problem a) retrieval only queries and b) retrieval, deletion and inser- tion queries.

6.1. General discussion and problem statement

As mentioned before, compacting the data in the SADB leads to less storage requirements, more time savings, and better performance in an update environment. Informally, the compaction problem can be stated as: find a storage scheme which minimizes the time needed to process a set of overlapping query results from a secondary storage device. In the example of the two overlapping queries A and B, the optimal storage scheme was obvious: store the two queries in the form of the three contiguous tuple sets A U B - A, A N B and A U/3 - / 3 . In a more complicated situation like the case of the three overlapping queries shown in Figure 4-a, there are several feasible solutions to the problem. One of those feasible solutions is optimal.

Figure 4-b shows the general case of the input to the compaction problem. In the general case, we have a set of retrieval regions (the solid rectangles), another set of deletion queries (the dashed rectangles) and a third set of insertion queries (shown by the black dots inside the dotted rectangles) all overlapping in an arbitrary fashion. Furthermore, every query in the problem is assigned a certain expected frequency of occurrence. Assuming that the tuples falling within the retrieval regions (i.e. qualifying in the corresponding retrieval queries) are to be stored sequentially in secondary memory, find the storage scheme which will minimize the time spent in processing all the retrieval, deletion and insertion queries. Assume that the results of the retrieval queries are discarded after every retrieval. Thus, if the same retrieval query is issued twice in succession, it must be accessed twice.

The next section formulates the problem of finding an optimal compaction scheme for whole queries as a graph problem and give an algorithm for it. The solution of this problem takes the form of a set of sets, each containing the identification of a group of retrieval regions whose tuples can be stored contiguously without any redundancy. Solving this problem involves solving a related sub-problem which is called the group membership problem. This problem is related to another problem known in the literature as the Consecutive Retrieval

Page 17: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 387

Problem (CRP). The problem discussed here differs in two important ways from the previous work done for for the CRP as exemplified by [3]: First, the problem presented here is more geared towards minimizing update performance than just minimizing storage. Second, the problem presented here allows nested queries. Although other work had addressed the CRP with nested queries (e.g. [4] and [5]), these papers do not consider the problem in the presence of updates. The group membership problem is defined and solved in section 6.3. Finally, section 6.4 presents a solution to the problem of compacting page-queries by extending the analyses given in the next section for compacting whole queries.

6.2. Finding an optimal compaction scheme for whole queries

This section and the following two sections present the method used for compacting the SADB.

Let us represent each retrieval region in the input to the compaction problem by a node in an undirected graph called the query intersection graph and the intersection of two queries by an edge connecting the two corresponding nodes. A compaction scheme is a grouping of the retrieval regions into a number of groups G1, G2, • .., Gn such that the following two conditions are satisfied:

(1) The tuples satisfying each query can be stored contiguously in memory (i.e. tuples qualifying in one query are clustered together) and

(2) Within each group, no duplicate tuples can be found.

Thus, a compaction scheme is a partitioning of the query intersection graph into a number of disjoint subgraphs each forming one group. Note that in general, there are many possible compaction schemes for a given set of intersecting queries. Figure 5-b shows the four possible compaction schemes for the three intersecting retrieval regions q~l, q~2 and q~3 shown in Figure 5-a. Figure 5-c shows the storage scheme for the ten tuples in the ADB for each of the possible compaction schemes. Note that the above two conditions defining a group are satisfied for every possible scheme. One of those schemes is optimal in the sense that it minimizes the time needed to process all the expected queries (includes retrieval, deletion and insertion queries) during the next time period. In the following sections, a greedy algorithm is presented to find a near-optimum partitioning.

To obtain one partitioning of the query intersection graph, the graph is traversed in a depth first manner starting from the node with the highest potential for producing good page-queries. This potential is assessed directly based on direct modeling of the benefit of storing the entire query in the SADB (i.e. all the page-queries emanating from this query). The formula is given below:

IQ~jl +c3 f~j C(qrj) _ 2 E f d k tQrj t3nQdkl k = l

l=1

+ Cdkq~j)

Page 18: Page-query compaction of secondary memory auxiliary databases

388 KAMEL

7

qr3

qrl

qrl

qr3

. . . . . . . qrl

qr3 qr3

qrl qrl

I i

qr2 . . . . . . . . i i

I J I I i

--i i I

i i i i I i _ _ 2 i - - - 4 -

i i i

i qr3

i . . . . . . 2

G 1 1 , 3 , 4 , 5 . 6 , 7 GI 1 .2 .3 .4 .5 ,6 G 1 2 . 3 , 4 . 5 . 6 GI 1 .3 ,4 .6

G 2 2 , 3 . 4 . 5 G 2 4 . 5 . 6 , 7 G 2 4 , 5 . 6 . 7 G22. 3, 4,5

G 3 4 , 5 . 6 . 7

Figure 5. The many possibilities for compaction schemes a) three intersecting retrieval regions and b) their four possible compaction schemes c) the storage scheme of their tuples in secondary memory.

Intuitively, this formula assesses the benefit per byte stored of prestoring the entire query qrj in the SADB. The first term represents the retrieval benefits over the normal cost of recomputing the query, the second term represents the update overhead with respect to deletion operations while the third term represents the update overhead due to insertion operations. In [3], the query with most tuples lying in the intersection with some other query is chosen to start with. Here, a measure based on considering the combined effects of future retrievals, deletions and insertions is employed. This formula is only used to select the first retrieval region to be tested in a cluster. Subsequent retrieval regions are added to the current group by traversing the graph in depth-first order and at each node, an adjacent node is selected if it satisfies the membership criterion for the current group and simultaneously maximizes the total benefit of the group. The membership criterion is tested every time a node is visited to determine whether or not this node can be added to the

Page 19: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 389

group of nodes in the current group. A node passes the test if it can be added to the current group without violating the two conditions: contiguity of the tuples within each query and the absence of duplicate tuples within a group. This test is called the group membership problem and is described in the next section. As an example of a node which cannot be added to a group, the query qr2 of Figure 4-b (the first case) cannot be added to the group G1 without violating the contiguity constraint. Every time a violating node is encountered, it is marked as being "rejected" and the search continues until the entire graph has been traversed. All the nodes of the graph not marked as being "rejected" form the first group are removed from the graph. The nodes of the remaining graph are then reset to "accepted" and a similar search is performed until the second group is found. The process is repeated until the entire graph has been partitioned into a number of groups G1 to G~. The combined benefit for group Gi is given by:

qT-:iCGi

- 2 fdk IC; n Qa I B + Cdkqrj

k=l

_ 2 E f i l IC nQ zl B

l = l

The major idea behind using this formula is that it is better to place all the good retrieval regions together in the same group as much as possible. Since the SADB space is limited, it is always better to arrange the groups such that a more skewed distribution of good vs. bad page-queries is obtained. Finally, note that the measure used in [3] to select the next query (the amount of overlap with the current group) is also taken into account here. This quantity directly affects the denominator of the above formula in the following way: Suppose that two queries are identical in every respect except that one overlaps more with the current group than the other. This query will result in a smaller denominator term and thus a larger value of the expected benefits. So this query will be preferred over the one which overlaps less with the current group. A recursive algorithm to perform the graph partitioning is given below:

1. algorithm groups;

2. {this algorithm partitions the query intersection graph G = (V, E)

3. to produce query groups satisfying the consecutive retrieval property}

4. procedure partition (v: node);

5. begin

6. min-query-id := nil;

7. min-weight := oc;

8. for each node w adjacent to v do

9. begin

Page 20: Page-query compaction of secondary memory auxiliary databases

390 KAMEL

10. if not accepted and acceptable then

11. if group-member(G[i],w) then

12. {note: splitting the above two tests avoids unnecessary

invocations of group-weight below}

13. begin

14. if group-weight(G[i],w) < min-weight then

15. begin

16. min-weight := group-weight(G[i],w);

17. min-query-id :-- w.id

18. end;

19. end

20. else acceptable := false;

21. {note: acceptable is never set to true in the current group once it becomes false}

22. if min-query-id ¢ nil then

23. begin

24. accepted[min-query-id] := true;

25. G[i] := G[i] U [min-query-id};

{add query to current group}

26. remove node min-query-id and all its

attached arcs from the graph;

27. {Note: a query is never a member of two groups}

28. partition(min-query-id); {recursive call}

29. end

30. else end;

31. end;

32. begin {begin main body of groups}

33. i := 1; {initialize group index. This is a global variable}

34. while graph not empty do

35. begin 36. acceptable := true; {reset acceptable array. No need to

reset accepted array}

37. select a node v with the largest initial benefit;

38. partition (v);

39. i : = i + 1

40. end;

41. numgroups := i; {numgroups is the total number of groups produced}; 42, end.

Page 21: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 391

Worst-case complexity analysis: The algorithm contains a recursive depth first search (the recursive call on line 28). If the query intersection graph is represented by its adjacency lists, then the time to complete one traversal is O(e) where e is the number of edges. If the graph is represented by its adjacency matrix, then the time for one traversal is O(n 2) where n is the number of nodes. Since the query intersection graph will most likely be sparse, it would normally be better to implement the graph by its adjacency lists. At each node, two tests are performed: 1) the group membership test (line 11) and 2) a weight assessment for the entire group including the node being tested (line 14). Note that the weight test included on line 16 is only included for clarity and can be substituted by an assignment to a variable in line 14. In section 6.4, it is shown that the complexity of the group membership test group-member (G[i], w) is O([G[i]llog [G[i]I). Assessing the weight requires computing a formula based on collected statistics only. Thus, in the worst case, a group membership test will need to be performed every time an edge is considered. The total time required is

O e then computed as (Ei=I Ia[~]l log la[i]l). The next section gives an algorithm which determines whether or not a given query can

be included in some specified group. The same algorithm produces the specific order in which the different parts of the resulting group are placed.

6.3. The group membership problem

A group of retrieval regions is defined as follows: all the tuples satisfying the queries of a group can be stored in memory such that 1) the tuples belonging to each query can be found in a contiguous area of memory and 2) no tuple in the group is stored more than once. The problem addressed in this section can be stated in the following way: Given a certain group Gi and a new query qrj, determine whether or not the queries of Gi and q~j form a group. If the answer is yes, the algorithm will also yield a description of how to store the resultant grou p such that the contiguity and nonredundancy conditions are met.

When two or more queries intersect, a number of smaller regions are formed by the intersection. Those regions can be defined and described by other query predicates. Those regions will be referred to as query elements or queryparts. Figure 6 shows four intersecting queries qrl - qr4. First a table called the query decomposition table having one entry for each query is constructed. Each entry in the table contains an identification of its query and all the parts it is composed of Table 1 is the query decomposition table for the four queries of Figure 6.

The process is initiated by considering any one of the queries to be a group containing one query. This group is represented by a list. In the following examples every list will be enclosed within two subscripted parenthesis. The subscripts help to match the parenthesis enclosing the same list. For example, if query q~l of Figure 6 were picked to be the first group then its representation would be:

The elements of the next query to be considered are then concatenated to the first list. If qr2 were our second query, the result would then be:

Page 22: Page-query compaction of secondary memory auxiliary databases

392 KAMEL

qr3

2

1

4 qrl

5 qr2

qr4

Figure 6. Four intersecting retrieval regions and their composing parts.

Table 1. The composing parts of queries qr l -- qr4.

Query Query Elements

qri 1 4 qr2 1245 qr3 1 2 3 4 5 6 qr4 4 5 6 7

Query Decomposition Table

(i14)11245

Next, all elements among the ones just added which also appear in the group list are removed. In our example, the result will be:

(114)125

Finally, all the elements just deleted from the added list are located in the older list and placed as close as possible to the added list. If this leads to having all the elements of the new query placed in an unbroken sequence, then the query can be successfully added to the group. A new list consisting of all the elements of the new query is formed by enclosing those elements within a new pair of parentheses. Note that in this method, each element is enclosed within a number of pairs of parenthesis. For our example, the result of this step would be:

(2(114)125)1

The last step involves moving some elements of the group list as close as possible to the just added elements. This is accomplished by shuffling all the elements in the group as well

Page 23: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 393

as those just added such that elements cannot cross the boundaries of any parentheses pairs which they belong to. In our example, applying those steps to queries qr3 and q~4 leads to the following sequence:

(2 (114) 125) 2123456 (2(114)125)236

(3(2(114)125)236)3

and now query q4 is added

(3 (2(114)125)236)34567 (3(2(114)125)236)37 (33(22(114)15)26)37

(33(22(11 (44)15)26)37)4

The following is a formal presentation of the algorithm used to determine group mem- bership. The Algorithm makes use of the following distinction between a contiguous block and a non-contiguous block.

A contiguous block is defined as the area between any two successive parentheses re- gardless of the type of parenthesis (i.e. right or left), or their subscripts. The only exception to this definition is the following definition of non-contiguous blocks.

A non-contiguous block consists of two portions: a right and a left portion. Two areas are determined to be part of one non-contiguous block (as opposed to being two separate contiguous blocks) if and only if the two areas satisfy the following conditions.

(1)

(2)

(3)

The left area is immediately to the right of a left parenthesis with subscript x,

The right area is immediately to the left of a right parenthesis with the same subscript x, and

The parenthesis inside form exactly matching pairs. Matching pairs are defined as a left and a right parenthesis with the same subscript but not necessarily occurring in succession.

The group-membership condition is tested by the following algorithm.

1. algorithm group-member(G, q): boolean; 2. missing := Q • G;

3. Q : = Q - G ; 4. {remove from Q all elements which also exist in G}

5. {the elements which have been removed are called the missing elements}

6. begin 7. i : = 1 ;

8. while missing ¢ 0 do 9. if ei E missing then {ei is the ith element in G from right to left}

Page 24: Page-query compaction of secondary memory auxiliary databases

394 KAMEL

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38. end.

begin

i : = i + 1 ; missing := missing - ei;

end;

else begin

successful := false; {initialize missing query element detection flag}

for each element ej in the current block do

begin

move ej to just before ei;

successful := true; {one more element found}

missing := missing - e j ;

end

if missing = ~ then

begin

group-member .'= true;

stop {successfully}

end

else if current block is contiguous then

begin

group-member := false;

stop {unsuccessfully}

end

else {this means that current block is non-contiguous}

begin

move all elements in the current block not in missing from

the right portion to the left portion

end;

end;

Intuitively, this algorithm operates in the following way. You place the elements of the new query to the right of the sequence representing the group and eliminate from it all duplicates with the group (lines 2 & 3). Next, you scan the elements of the group from right to left. You stop successfully any time if the entire query can assembled in a contiguous area from right to left. Every time you find an element that is not a member of the new query (i.e. not one of the elements just deleted in the last step), you remember its position and you begin scanning forward in the current block only (when finishing the right portion in a non-contiguous block, continue in the left portion). One of two situations will be encountered. Either an element of the query will be encountered in the current block or not. If one is found, it is moved from its current position to just before the last

Page 25: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 395

element position. If no other missing query elements can be found within the current block, then one of two possibilities is encountered. If the current block is a contiguous one, the algorithm stops, unsuccessfully. If, on the other hand, the current block in non-contiguous, then move all non-query elements from the right portion of the block to its left portion. Scanning then continues from right to left until the algorithm terminates either successfully or unsuccessfully as outlined above. Worst-ease complexity analysis: The above algorithm has two separate steps: 1) elimi- nation of duplicates and 2) group rearrangement. Elimination of duplicates (line 3) can be done efficiently by sorting both query and group elements. A parallel scan can then be used to eliminate the duplicates:. This can be done in time O (1Q[loglQ] + I GlloglGI + I QI + I GI) which can be reduced to O([GlloglG]).

The second step involves rearrangement of the group elements to answer the question of group membership and to simultaneously obtain a consecutively retrievable group, if possible. This is done in lines 6-38. By examining the algorithm, it is clear that no element needs to be visited more than once. Thus, the running time for this step is O(IG]) and the total running time is O(]GlloglG[ + ]G]). This can be reduced to O([G]loglG[).

Once the data in the SADB has been selected and compacted, the reorganization period ends and the system enters what have been referred to as the next time period. During this period, use is made of the preprocessing done to reduce the processing cost of queries. The next section addresses the problem of compacting the SADB at the level of page-queries instead of whole queries.

6.3.1. Finding an optimal compaction scheme for page-queries

This section generalizes the results of the previous two sections when applied to page- queries instead of whole queries. The main idea is to first compact a set of whole-queries and then weight the page-queries in groups based on the whole-query compaction scheme. Thus, instead of weighting individual page-queries, page-query groups, which we also call page-groups are weighted. This is illustrated by the diagram in Figure 7.

In the example shown in Figure 7, the four queries, ql, q2, q3, and q4 are compacted into two groups: G1 and G2. Group G1 consists of only one query and group G3 consists of three queries. This is one compaction scheme. Let us follow this compaction scheme to the end until the optimal set of page-groups is found. Each query result is partitioned along page-query boundaries. These page-queries are shown in the figure. This process results in the formation of page-groups <Pi - Gj > which consist of the data items from page Pi which participate in the result of any of the queries in the group Gj.

The idea of grouping overlapping page-queries into page-groups is to allow each page- group to be assigned a weight separately. Sometimes, however, there are page-queries which do not overlap any other page-queries not even with those within the same group. These page-queries are called separable page-queries because they can be weighted separately. In the figure, all page-queries of group G 1 as well as the page-queries: <Pl - ql >, <Pl - q2 >, <Pl - q3> from group G2 are examples of such page-queries. In addition to the separable page-queries and the page-groups, there may be a number of page-queries which are not separable from each other but are separable from some other page-queries pertaining to the

Page 26: Page-query compaction of secondary memory auxiliary databases

396 KAMEL

G2

q4

G1 ~ 20

P3 21

p4

P2 16

P3

ql q2 q3 (b)

GI: 19202122 q ~ )

G2:123456789 10 11 12 1314 15 16 17 18

(c) ql q2 q3

Figure 7. Compacting a group of page-queries. The process is done in steps: 1) an initial compaction is done based on whole page-queries, 2) weights are assigned to page-query groups, 3) the best page-queries page-groups are selected to populate the SADB, and 4) another compaction scheme is tried and steps 1-3 are repeated until the optimum is reached. The diagram illustrates an example consisting of 4 overlapping queries (a). After the compaction is done based on whole queries, two query groups are formed (b). The groups are then sliced along i~age-query boundaries and the resulting slices are further broken up into the smallest non-overlapping segments which are then numbered arbitrarily in (b). (c) shows the initial sequence of these segments as produced by the whole query compaction before individual page-queries page-subgroups, and page-groups are weighted (c, left) and the contents of the SADB after the selection terminates (c, right).

same page. For example, in the figure, one can see that <Pa - (q,, q2)> is separable from <P3 - q3>. These separable entities are called separable page-subgroups. In the figure, there are two separable page-subgroups: <Pa - (ql, q2)> and <p2 - (q2, qa)>. Thus the elements which enter the final competition in the SADB consist of separable page-queries, separable page-subgroups, and page-groups. In our example, these are:

Page 27: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 397

the separable-page-queries:

<Pl -- q l> , <Pl -- q2>, <Pl -- q3>, <Pl -- q4>, <P2 -- q l> , <P2 -- q4>, <P3 -- qa>, <P3 -- q4>, <P4 -- q4>, and

the separable page-subgroups:

<p2 - (q2, q3)>, <p3 - (qi, q2)>, and

the page-group:

<P4 -- G I > .

Thus, each separable page-query, separable page-subgroups, or page-group is assigned a weight separately and the best set of page-queries, page-subgroups, and/or page-groups are then chosen to populate the SADB. for the example shown in the figure, those winning sets are:

the separable-page-queries:

<Pl -- q l> , <P3 -- q4>, <P4 -- q4>,

the separable page-subgroup:

<P2 - (q2, qa)>, and

the page-group:

<P4 - G I > .

Starting from the compaction string of the whole queries, every query segment (the elements of the string) that does not participate in one of the winning sets above is struck out. This will lead to the final organization of the SADB components on the disk, which is indicated in the lower right part of the figure. It can be seen clearly that all the components of the same query are kept together. Algorithm assign assigns weights to all separable components according to the strategy outlined above. The algorithm assumes the existence of the data structure shown in Figure 8.

a lgor i thm assign

1. {assigns weights to all information packets in the SADB. An information

2. packet is called an element and consists of either a separable page-query,

3. a separable page-subgroup, or a page-group.}

4. funct ion reach (eno: element_id; i: page_id;j: query_id): boolean;

5. {this function returns true if the page-query <Pi - qrj> is

6. reachable (i.e. not separable) from element eno.}

7. procedure append (eno: element_id; i:page_id;j: query_id);

Page 28: Page-query compaction of secondary memory auxiliary databases

398 KAMEL

G1

G2

Gn

:" S1 W1

\ , $2 W2

Sm Wm

groups array elements array

Figure 8. The data structure used by algorithm assign.

8. {appends page-query <Pi - q~j > to the element eno.}

9. function size (eno: element_id): integer;

10. {this function computes the size of the union of all page-queries in element eno}

11. function weight (eno: element_id): real;

12. {this function computes the weight of an element based

on the model described in the paper}

13. begin {main body of algorithm assign}

14. eno := 1

15. for all groups gi do {gi is assigned successive values of group ids}

16. begin 17. groups[gi] := eno;

18.

19.

20.

21. 22.

23.

for all retrieval regions qi do

{qi is assigned successive values of query ids}

for all database pages pi do

{pi is assigned successive values of page ids}

begin homed := false;

for all previously stored elements enoi do {enoi is assigned successive values of element numbers from 1 up to (eno-1)}

Page 29: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 399

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

if reach(enoi, pi, qi) then {test for separability}

begin {append page-query to an existing element}

append(enoi, pi, qi); {assign page-query to

an existing element}

homed := true {an indication that the page-

query has been assigned to an element}

end; {appending page-query to an

existing element}

if homed := false then

begin {create a new element}

append(eno, pi, qi);

{assign page-query to a newly created elements}

eno := eno + 1

{increment current element pointer}

end {creating a new element}

end {looping for all database page-queries for one group}

end {looping for all groups}

36. for all enoi do {now assign sizes and weights to every element enoi}

37. begin {assign sizes and weights}

38. elements[enoi].size := size(enoi);

39. elements[enoi].weight := weight(enoi);

40. end {assigning sizes and weights}

41. end. {assign}

The next section describes the effect of the compaction process on the method used to assign weighs to individual page-queries.

7. Simulation Results

In order to assess the potential benefits of the concepts and ideas presented in this paper, it is necessary to provide some measure of the expected benefits. A number of statistically oriented experiments which simulate a large database queried by a predetermined collection of queries have been conducted in this respect. The queries represent the anticipated retrieval regions. The main focus of the simulation was to study the amount of potential benefits in

Page 30: Page-query compaction of secondary memory auxiliary databases

400 KAMEL

terms of page accesses saved as a result of using page-queries. Results of two experimental runs, representative of the results obtained, are presented next.

Experiment 1" Evaluating the number of pages accessed that can be saved as a result of the SADB. Experimental parameters:

• database size: 100,000 tuples

• page capacity: 100 tuples

• number of pages: 100,000/100 = 1000

• number of queries: 2

• query selectivity: 5%

• additional considerations: the effects of page-query overlaps.

• description of experiment: two queries with 5% selectivities were randomly generated. Each page-query was evaluated and assigned a weight. The SADB was populated gradually by page-queries starting with those with the highest weights and ending with those with the lowest weights. The number of page-queries stored at any given moment represents the pages which can be eliminated from the search. Two different strategies for selecting the page-queries were used: 1) to process all the page-queries of one query first then to consider all the page-queries of the other (the serial case) and 2) to select the best page-queries from either queries simultaneously (the parallel case). In both cases, simulation was run twice, once without considering the effects of overlaps, and once by considering the additional benefits obtainable from of the compaction process.

• results presented as: four curves shown in Figure 9. The first is the solid curve with a break in the middle represents the number of pages saved during query evaluation when populating the SADB serially and without considering the benefits of compaction. The x-axis represents the size of the SADB in tuples. The solid continuous line represents the parallel case without compaction. The dashed lines are identical except that they do consider the benefits of compaction.

analysis of data and conclusions:

(1) The highest payoffs are clearly obtained from the parallel case with compaction.

(2) There is no point along the spectrum of SADB sized where the serial case provides better performance that the parallel case. Both cases provide the same savings in only two points: 1) a completely empty SADB and 2) a completely full SADB.

(3) The degree of benefit obtainable by compaction is more pronounced for higher SADB sizes. This is expected as more overlaps occur. One can also observe that higher compaction benefits are obtainable, for the same SADB size, in the parallel case than in the serial case. Since it is desirable to operate the SADB at low capacity, when it provides the highest benefits, this last property is helpful.

Page 31: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 4 0 1

m

Z

2O00 OO

175D OO

1250 O0 " i ~¢ l~ /

Iooo oO /~ t

750 OOO

500 Dog

~50 000

o o o o o o , ,

o ooooo 125o oo e5o~ oo 3~sd oo 5ooo oo 62sd oo 7sod oo eTso oo zoodo o

STORACE

Figure 9. Number of pages saved versus SADB size for two queries with 5% selectivity, each. The benefits of compaction are indicated in the dashed curves. The double curves on the right (both solid and dashed) represent the serial population of the SADB by preprocessing queries one after the other. The two curves on the left represent the parallel case where page-queries of both queries are mixed together and the SADB populated from the best page-queries in the mix.

E x p e r i m e n t 2: Evaluating the number of pages accessed that can be saved as a result of the SADB.

E x p e r i m e n t p a r a m e t e r s :

• database size: 100,000 tuples

• page capacity: 100 tuples

• number of pages: 100,000/100 = 1000

• number of queries: 10

• query selectivity: 5%

• additional considerations: the effects of page-query overlaps.

• description of experiment: Same method as experiment 2 but using 10 queries instead of 2.

• results presented as: four curves shown in Figure 10. The axes are the same as in experiment 2 above.

• analysis of data and comparison with the 2 query case:

Page 32: Page-query compaction of secondary memory auxiliary databases

402 KAMEL

'B

"B z

..t t j

,,¢

/ ~ SERIES W/O CO. ~ * PARALLEL W/O CO. ~a SERIES WITH CO.

PARALLEL WIIH CO

5000.OO }.00"t04 1.50;~-04 200e+O4 2.50¢I-04 3.00~'t04 3.50e*04 400;+04 4.50e*04 5.0oe*04 STORAGE

Figure 10. Number of pages saved versus SADB size for 10 queries with 5% selectivity, each.

(1) Comparing this figure with Figure 9, one can make the observation that the mere existence of more queries in the system results in higher savings in page accesses for the same size of the SADB. This is especially true for the compacted parallel case. For example, at SADB size 5000, over 2800 page accesses can be saved in a parallel compacted SADB as opposed to only about 1400 page accesses if only 2 queries were available. The reason for this is the increase in the size of the page-query pool from which to select the SADB contents.

(2) The same observation as 1 above also applies for the parallel uncompacted case. For example, at SADB size 5000, the 10 query case allows saving 2200 page accesses as opposed to only 1300 in the 2 query case.

(3) In the serial case, both compacted and uncompacted, the number of queries involved in the selection process does not affect the SADB performance.

8. Conclusions

It is possible to divide the database store into small data packets (the page-queries), each corresponding roughly to the result of evaluating one query out of one memory page. Using those page-queries as the smallest units of information retention in an auxiliary secondary-memory database (the SADB) results in better space utilization and fewer page

Page 33: Page-query compaction of secondary memory auxiliary databases

PAGE-QUERY COMPACTION 403

accesses. The amount of unnecessary redundancy found among independently selected

page-queries can significantly affect performance. Compacting the contents of the SADB

can result in less storage requirements, more savings through better utilization of space, and better update performance through better clustering and value sharing among the redundant

data. The compaction process can be gracefully integrated with the selection process by

applying the weights to collections of page-queries. The process of compaction can be performed efficiently, algorithms have been developed to find good compaction schemes

in the presence of nested queries and while taking update performance into account.

References

1. Blakeley, J., Larson, E, and Tompa, E, "Efficiently Updating Materialized Views," in: Proceedings of the ACM S1GMOD Annual Conference, 1986.

2. Blakeley, J., Coburn, N., and Larson, R, "Updating Derived Relations: Detecting Irrelevant and Au- tonomously Computable Updates," in: Proceeding of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, August 1986.

3. Deogun, J.S., Raghavan, V.V., and Tsou, T.K.W., "Organization Of Clustered Files for Consecutive Re- trieval," in: ACM Transaction on database Systems, Vol. 9, No. 4, December 1984, pp. 646-671.

4. Ghosh, S.E, Database Organization for Data Management, Chapter 6, Academic Press, 1977.

5. Gupta, U., "Bounds on Storage for Consecutive Retrieval," in: Journal of the Association for Computing Machinery, Vol. 26, No. 1, January 1979, pp. 28-36.

6. Hanson, E., "A Performance Analysis of View Materialization Strategies," in: Proceedings of the ACM SIGMOD International Conference, 1987.

7. Kamel, N., "The Use of Controlled Redundancy in Self-Adaptive Databases," Ph.D. thesis, Computer Science Department, University of Colorado, Boulder, Colorado, 1985.

8. Kamel N. and King R., "Intelligent Database Caching through the Use of Page-Answers and Page-Traces," ACM TODS, Forthcoming.

9. Kamel, N., "Page-Queries: A Tool for Organizing Secondary Memory Auxiliary Databases," Submitted to Information Systems, 1991.

10. Kou, L., "Polynomial Complete Consecutive Information Retrieval Problems," in: SIAM Journal of Com- puting, Vol. 6, No. I, March 1977.

11. Larson, RA. and Yang, H.Z., "Computing Queries fromDerived Relations," in: Proc. of the 11th International Conference on Very Large Databases, 1985, pages 259-269.

12. Maier, D. and Ullman, J., "Fragments of Relations," in: Proceedings of the ACM SIGMOD International Conference, 1983.

13. Sun, X.H. and Kamel N., "Solving Implication Problems for Database Applications," in: Proceedings of the ACM S1GMOD, Portland, Oregon, 1989, 185-192. Also in: ACM SIGMOD Record, Vol. 18, No. 2, June 1989.

14. Daniel Rosenkrantz and Harry B. Hunt, III, "Processing Conjunctive Predicates and Queries," in: Proceed- ings of the of 6th VLDB, (1980), 64-72.

Page 34: Page-query compaction of secondary memory auxiliary databases

404 KAMEL

Appendix

The following table summarizes the quantities used throughout the paper. Page-queries are defined in section 4.

Table A. 1. General notation.

qri~ qdi, qii qS _ P - J ~PSJ

ri ~ qri~ ~ri ~ ~ri

S <Pi -- q r j > S i

P ~ P i - q r j > : S k

<(P i , *) J -- qr j ~ J

~ P i -- qr j ~ ~ P i -- qr~ SJ~ : Sk

< p i -- ( q r j l , q r j 2 , . . . , q r j ~ ) > : S k

< ( P i l , P i 2 , . . . , P i n ) -- q r j > : S k

<Pi - G j > n r , rid, n i

Q r i

f~k, f~k, f~k

B

P j

m

M s

tac

C(q~j)

Cdk qr j

C i l q r j

C d k P i q r j

C i l P i q r j

G i

Retrieval, deletion or insertion query number i. Selection, Projection, Join or PSJ retrieval query number i.

the superscript PSJ may be dropped (default). Similar definitions apply for deletion and insertion queries.

Selection page-query for page Pi and query qrj'S

A predetermined sequence of pages. Projection page-query for page Pi and query q~. with respect to the

page sequence Sk. J Heterogeneous join page-query for page Pi and query qr j '

Semi-join-page-query for page Pi and query qJrj.

PSJ page-query for page Pi and query q ~ with respect to the page

sequence Sk. PSJ page-queries for page Pi and queries qrj 1, qr j 2, • • •, Clrj n with

respect to the page sequence S k .

PSJ pages-query for pages P i l , P i z , . . •, P in and query qr~ with respect to the page sequence Sk.

page-group for page Pi and query group G j . the number of retrieval, deletion or insertion queries. the response set of query qri .

size in bytes of Q r i . Frequencies of reference of retrieval, deletion, or insertion queries

number k may equal zero if query number k is not referenced. capacity of one page of secondary storage in bytes. page number j in the main database. the set of tuples in page number j in the main database. the number of main database pages. the size of storage allocated to the MADB in pages. the size of storage allocated to the SADB in pages. average time needed to access one physical page. cost of one time unit of processor time. the cost of computing qr5 measured in page accesses. the cost of updating Q r j is response to qak measured in page accesses. the cost of updating Qr j is response to qit measured in page accesses. the cost of updating <Pi - qr j > is response to qak measured in page

accesses. the cost of updating <Pi - qr j > is response to qit measured in page

accesses. query group number i.