9
Inform. Systems Vol. IO, No. I, pp. 27-35, 1985 Printed in the U.S.A. 0306-4379185 $3.00 + .M) 0 1985 Pergamon Press Ltd. DESIGN OF A VIRTUAL DATABASE SHELDON SHEN Academic Faculty of Accounting, The Ohio State University, 1775 College Road, Columbus, OH 43210 (Received 25 August 1983) Abstract-This paper proposes a database management system that automatically executes and in- tegrates data retrieval, information deduction and information processing for query processing. The major thesis is that a user should be allowed to define and access any information related to a data- base-even if the information is not explicitly stored in the database; furthermore, this process of generating new information should be automated in a sense that users do not have to write procedural programs. Thus, the notion of data independence and query processing can be maintained. 1. INTRODUCTION The evolution of database technology has been in- fluenced and characterized by two important con- cepts: data independence and query processing. Data independence requires no modifications in an application program when a database is restruc- tured; thus, a user is relieved from the details of data representation and, consequently, is able to concentrate on the information content of the data. This concept has traditionally been represented by a database abstraction, as in Fig. 1, where two lev- els of data independence-physical data indepen- dence and logical data independence-are demon- strated. Query processing, on the other hand, elim- inates procedural programs and enables data re- trieval to be automated; the idea, of course, is to free a user from the details of programming. Recently, attention has been focused on adding inferential or deductive capabilities to databases. These capabilities permit information that is not stored explicitly in a database to be implied. In these systems, the conceptual database contains implied data that do not have underlying physical data to instantiate. This idea is conceptually rep- resented in Fig. 2, where the deductive database, consisting of the actual logical data and the addi- tional implied data, becomes the new conceptual database. To users, no distinction is made between the actual data and the implied data, for the query language can be directly applied to the implied database-even if the implied data do not exist. Admittedly, the deductive database greatly en- hances the database capabilities. More importantly, data independence and query processing can all be preserved. Yet the deductive database, significant as it is, is not a panacea for all the database appli- cations. Most meaningful information, as discussed below, results from data processing; simple data de- duction sometimes does not provide enough insight for data interpretation. As an example, a hypothesis tested as true or false is certainly more meaningful than merely a set of data. The former, however, must come from a statistical testing procedure to analyze and process the latter. Since no hypothesis can be simply de- duced or implied by any set of data, no query can be directly used to establish the hypothesis in any deductive database. As another example, “John got aB in CS501” provides more information than “John had 78.5 in the midterm and 63.5 in the final”. Again, the former, computed from a grade calculation pro- cedure processing all the scores in CS501, cannot be deduced by the latter. Thus, no simple query system exists in current database systems to per- form such information processing. Solutions for the above example, of course, exist. A user, striving for more information, is told to imbed a query language in other programming lan- guages to perform data processing. Obviously, it is difftcult and awkward to combine two entirely dif- ferent languages, not to mention that the program- ming, so deliberately disguised in database jargon, becomes, unfortunately, inevitable. Yet the lack of data processing capabilities in any database man- agement system necessitates mixing two languages. A simple example should suffice to illustrate the power of adding data processing capabilities for query processing and evaluation. Consider a database defined by a relation R (STUDENT, COURSE, MIDTERM, FINAL), recording stu- dents’ midterm and final examinations’ scores in different courses; an occurrence of R such as (John, CS501, 78.5, 63.5) means that John got 78.5 in the midterm and 63.5 in the final in CS501. Suppose three procedures, AVERAGE, SORT and GRADE, are defined and stored in the system pro- gram library: AVERAGE computes the average of the midterm and final for each student, SORT sorts the averages and GRADE transforms the average into a letter grade based on the sorted average dis- tribution and a grading policy. A query, “FIND GRADE FOR JOHN IN CS501”, as an example, can be processed by imbedding SEQUEL in a PAS- CAL-like language as shown in Fig. 3 (to highlight the essentials of the program, programming details, such as data type definitions, variable initiation, etc., are neglected). Alternatively, a database sys- 27

Design of a virtual database

Embed Size (px)

Citation preview

Inform. Systems Vol. IO, No. I, pp. 27-35, 1985 Printed in the U.S.A.

0306-4379185 $3.00 + .M) 0 1985 Pergamon Press Ltd.

DESIGN OF A VIRTUAL DATABASE

SHELDON SHEN Academic Faculty of Accounting, The Ohio State University, 1775 College Road, Columbus, OH

43210

(Received 25 August 1983)

Abstract-This paper proposes a database management system that automatically executes and in- tegrates data retrieval, information deduction and information processing for query processing. The major thesis is that a user should be allowed to define and access any information related to a data- base-even if the information is not explicitly stored in the database; furthermore, this process of generating new information should be automated in a sense that users do not have to write procedural programs. Thus, the notion of data independence and query processing can be maintained.

1. INTRODUCTION

The evolution of database technology has been in- fluenced and characterized by two important con- cepts: data independence and query processing. Data independence requires no modifications in an application program when a database is restruc- tured; thus, a user is relieved from the details of data representation and, consequently, is able to concentrate on the information content of the data. This concept has traditionally been represented by a database abstraction, as in Fig. 1, where two lev- els of data independence-physical data indepen- dence and logical data independence-are demon- strated. Query processing, on the other hand, elim- inates procedural programs and enables data re- trieval to be automated; the idea, of course, is to free a user from the details of programming.

Recently, attention has been focused on adding inferential or deductive capabilities to databases. These capabilities permit information that is not stored explicitly in a database to be implied. In these systems, the conceptual database contains implied data that do not have underlying physical data to instantiate. This idea is conceptually rep- resented in Fig. 2, where the deductive database, consisting of the actual logical data and the addi- tional implied data, becomes the new conceptual database. To users, no distinction is made between the actual data and the implied data, for the query language can be directly applied to the implied database-even if the implied data do not exist.

Admittedly, the deductive database greatly en- hances the database capabilities. More importantly, data independence and query processing can all be preserved. Yet the deductive database, significant as it is, is not a panacea for all the database appli- cations. Most meaningful information, as discussed below, results from data processing; simple data de- duction sometimes does not provide enough insight for data interpretation.

As an example, a hypothesis tested as true or false is certainly more meaningful than merely a set of data. The former, however, must come from a

statistical testing procedure to analyze and process the latter. Since no hypothesis can be simply de- duced or implied by any set of data, no query can be directly used to establish the hypothesis in any deductive database. As another example, “John got aB in CS501” provides more information than “John had 78.5 in the midterm and 63.5 in the final”. Again, the former, computed from a grade calculation pro- cedure processing all the scores in CS501, cannot be deduced by the latter. Thus, no simple query system exists in current database systems to per- form such information processing.

Solutions for the above example, of course, exist. A user, striving for more information, is told to imbed a query language in other programming lan- guages to perform data processing. Obviously, it is difftcult and awkward to combine two entirely dif- ferent languages, not to mention that the program- ming, so deliberately disguised in database jargon, becomes, unfortunately, inevitable. Yet the lack of data processing capabilities in any database man- agement system necessitates mixing two languages.

A simple example should suffice to illustrate the power of adding data processing capabilities for query processing and evaluation. Consider a database defined by a relation R (STUDENT, COURSE, MIDTERM, FINAL), recording stu- dents’ midterm and final examinations’ scores in different courses; an occurrence of R such as (John, CS501, 78.5, 63.5) means that John got 78.5 in the midterm and 63.5 in the final in CS501. Suppose three procedures, AVERAGE, SORT and GRADE, are defined and stored in the system pro- gram library: AVERAGE computes the average of the midterm and final for each student, SORT sorts the averages and GRADE transforms the average into a letter grade based on the sorted average dis- tribution and a grading policy. A query, “FIND GRADE FOR JOHN IN CS501”, as an example, can be processed by imbedding SEQUEL in a PAS- CAL-like language as shown in Fig. 3 (to highlight the essentials of the program, programming details, such as data type definitions, variable initiation, etc., are neglected). Alternatively, a database sys-

27

28 SELDONSHEN

user's view

user’s view

physical database loglcal database

Fig. 1.

user’s view

tern can be designed such that the answer of the query is obtained without requiring a user to write programs; obviously, we prefer the latter.

In this paper, we propose adding data processing capabilities in a database management system to integrate data retrieval, information deduction and information processing activities. A new concep- tual database, called a virtual database, is pre- sented to users as another level of database ab- straction, as in Fig. 4. A virtual database contains additional virtual data that do not explicitly exist and cannot be implied. A virtual database query processor, basically a problem solver, analyzes a query and forms a solution plan for query pro- cessing. This plan might include retrieving data from a conventional database, performing infor- mation deduction and calling relevant procedures for data processing.

In the following, First Order Logic will be re- viewed in Section 2.1. First Order Logic, as a con- ceptual framework, is more desirable than other programming languages. Programming languages, designed basically for unambiguous parsing, are generally intricate and detailed in a way which ob- scure some of the semantic relations between dif- ferent pieces of a program. Therefore, for semantics investigation purposes, we must be prepared to abandon the absence of ambiguity in favor of con- ceptual clarity by using languages which are less elaborate than that required for the purpose of pars- ing, and such a language is logic. Section 3 de- scribes the virtual database and discusses its spec- ification through a virtual database schema. The difference between a virtual database schema and a user’s logical database is emphasized. Virtual query processing, which is understandably more complex than regular query processing, is dis- cussed in Section 4. And, finally, in section five, conclusions and applications of virtual data- bases are presented.

2. BACKGROUND

2.1 First Order Logic It has been shown that the information deduction

capability of a database requires a database to be

PROC~OUREAVERAGE(flle,file); PROCEOUWSORT(flle,file); PROCEOUREGRAOE(flle,flle); BEGIN

SLETSCORE-RECORD-FILEBE $SELECTSTUOENT,COURSE,NIOTERH,FINAL SFROHR $W+&ECOURSE-CS501

SOPENSCORE-RECORD-FILE; SFETCHSCORE-RECORD-FILE; AVERAGE(SCORE-RECORD-FILE,AVESCORE-RECORO-FILE); SORT(AVESCORE-RECORD-FILE.SORTEO-RECORD-FILE): GRAD~(SORTEO-RECORD-FILE,GRAOE-RECORD-FILE):'. $CREATEGRAOE-RECORD-TABLE SLETGRAOEBE SSELECTGRADE SFRLYIGRADE-RECORD-TABLE SWHERESTUOf!NT-'Jot-in'

SOPENGRAOE

END.

Fig. 3

described by First Order Logic [2, 61. First Order Logic is a class of languages involving the following primitive symbols: variable and constant symbols, function symbols and predicate symbols. A term is a variable symbol, a constant symbol, or an expres- sion of the form h(tl , . . . , t,), where h is an n-ary function symbol and tl , . . . , t, are terms. An atomic formula is an expression of the form P(t, , . . . , t,,), where P is an n-ary predicate symbol and tl, . . . , t, are terms. An atomic formula or the negation of the atomic formula will be referred to as a literal. The legitimate expressions in first order logic are called well-formed-formulas (wffs). Well-formed- formulas are defined using atomic formulas, paren- theses, the logical connectives and the universal quantifier as follows:

(1) atomic formulas are wffs; (2) if A is a wff and x is an individual variable,

then (V x)A is a wff, (3) if A and B are wffs, the (A), (A) + (B) are

wffs; (4) the only wffs are those obtained by finitely

many applications of 1, 2, 3. A wff consisting of a set of disjunctive literals is referred to as a clause. A clause where the number of positive literals is at most one is called a Horn clause. Procedures exist which could convert any wff to a set of clauses [5].

A many-sorted logic is an extension of first order logic where variables belong to different domains. A deductive database contains a set of many-sorted logical expressions without existential quantifiers and function signs [2, 61. Generally, a distinction is made between an intentional database and an ex- tentional database. An intentional database con- sists of first order assertions and rules of inference

,-p+ user's view

user’s view

user's vlew

pnysical aamase logical aatabase aemtlve aatabase

Fig. 2.

Design of a virtual database 29

pnyslcal oatabase logical database virtual catabase

Fig. 4.

without constant symbols for database description purposes. The actual occurrences or instances of an intentional database are called an extentional database. A query is a Horn clause without positive literals. A closed query does not have any variable and calls for a yes or no answer. In contrast, an open query requires a set of objects as answers, which satisfy some conditions stated in the query.

Clearly, the relational database model is a subset of the deductive database. While both models con- tain atomic formulas (called relations in the rela- tional model), the deductive database includes ad- ditional rules of inference. Atomic formulas, conceptually simple as in the relational model, en- able data to be retrieved by non-procedural query language. It is the rules of inference, however, that make information deduction possible. These two database structures are compared by an example shown in Fig. 5.

The GRANDFATHER relation in Fig. 5(a) is called an implied relation because it is not explicitly instantiated. To find out who c’s grandfather is in Fig. 5(b), traditional query processing can be di- rectly applied to the GRANDFATHER relation. In Fig. 5(a), deduction, together with data retrieval, must be used, for there is no data stored in GRANDFATHER. This seemingly inefficient way of query processing, however, has several advan- tages. First, one table, instead of two tables, is enough to store the same amount of information. As a result of this, data redundancy is reduced. Sec- ond, updating in Fig. 5(a) is easier than in Fig. 5(b). For example, suppose the first tuple FATHER(a, 6) is updated to FATHER (a, w). In Fig. 5(a), we simply change FATHER (a, b) to FATHER (a, w); in Fig. 5(b), however, we also have to remember to delete GRANDFATHER (a, c). Otherwise, in- consistency arises. And, finally, deductive capa-

physicaldatabase:

database schema:

GRANDFATHER a c 5 a

datadasescnem:

FATHER(X, Y); FATHER(X,Y); ~. ~~ ~_ FATHER( X, Y), FATHER(Y, Z) --> GRANOFATHER(X, Z); GRANOFAMER( X, 2);

conceptual database: conceptual database:

FATHER(X,Y); FAlHER(X, Y); GRANDFATHER( X, 2). GRANOFAlhER(X,Z).

-deductlvedatabase- -relationaldatabase -

Fig. 5(a). Fig. 5(b).

user’s view

user’s view

user’s view

bility is a necessity in incorporating intelligence into database applications.

2.2 Database schema and conceptual database In the above example, the deductive database is

more powerful than the relational database, for non- existent data can now be implied. More impor- tantly, a user, based on his conceptual view, is never aware of the deductive process, for he cannot distinguish between a true, actual relation and an implied relation. This is made possible by adding rules of inference into, rather than the conceptual database, the intentional database or database schema, as in Fig. 5. Hence, it is important here to distinguish between a database schema and a con- ceptual database. In the relational database, no dis- tinction is necessary, of course, as both of them are represented as a user’s view or logical database. In this paper, however, a conceptual database, con- sisting of a set of tables, is still used to refer to a user’s view of a database; a database schema, on the other hand, is purposedly hidden from users. Thus, a user can maintain the same, simple view about the world data as advocated by the relational database; that is, a world database can always be modeled as a set of tables-even if the underlying logical structure is now not tables in nature, i.e. there are also rules of inference.

The separation of the database schema and the conceptual database provides conceptual simplicity for users as well as database designers, for design- ers can concentrate on the database schema inde- pendently of the conceptual database; furthermore, the same database schema can support multiple conceptual databases. This separation, however, necessitates a front-end processor, called a deduc- tive query analyzer, on the top of the conventional query processor, as in Fig. 6. This query analyzer,

phvslcal datame:

FATHER m

30 SELDONSHEN

- relational dataoase -

Fig. 6.

- oeductlve aatabase -

upon a query request, evaluates the query and calls a conventional query processor if data explicitly exist; otherwise, the analyzer should generate the data for the query. Different implementations are possible [2, 61, and one of them, called PROLOG, will be discussed in the next section. We make a slight notational change (i.e. change + to t) to maintain PROLOG notations; however, no confu- sion should arise.

2.3 Procedural interpretation of logic programs Databases defined through logic can be inter-

preted and implemented in several ways. Of par- ticular intrest is PROLOG (PROgramming language based on LOGic), which interprets logical expres- sions (Horn clauses in particular) procedurally. In this system, no distinction is made between a pro- gram and what is conventionally regarded as tables or files of data. This indicates a very important im- plication; that is, a database may be viewed as a logic program, in which retrieval is automatically taken care of through resolution [3]. The following is a brief discussion of PROLOG.

A clause in PROLOG is written as Bi , . . . , B, + A, . . . , A,, where B,, . . . , B,, Al, . . . , A, are literals with each free variables implicitly univer- sally quantified. If m 5 1, then B,, . . . , B,, c A,, . . . ) A,, is called a Horn clause. PROLOG is pri- marily concerned with resolution on Horn clauses. There are four types of Horn clauses and their pro- cedural interpretations are as follows:

(1)

(2)

(3)

(4)

m = O;n = OdenotedasO. This is the empty clause and is interpreted as the STOP instruction. m = 1, n # 0:B +-A,, . . . , A,. This is a rule of inference. It is viewed as a procedure declaration, where B, the distin- guished literal, is the head of the procedure, and the procedure calls, A,, . . . , A,,, are the body of the procedure. Notice that if A,, . . . , A, are known, then B is known. This is re- ferred to as a bottom-up viewpoint. Alter- natively, the top-down viewpoint maintains that in order to know B, it is necessary to know or solve AI, . . . , A,, first. m = 1, n = 0:B t. This is a Horn clause to assert B. It is also looked upon as a procedure with empty clauses. m = 0, n # O:+A,, . . . ,A,. This is called a goal statement, denoting a query to be answered.

The PROLOG interpreter is invoked when a goal statement is presented. The successive executions to solve a problem are based on the resolution prin- ciple. As an example, consider the following set of Horn clauses:

+ GRANDFATHER(a, Z).

FATHER(a, 6) c

FATHER( b, c) t

GRANDFATHER(X, Z) + FATHER(X, Y), FATHER( Y, Z)

To find “whose grandfather a is”, PROLOG exe- cutes the query “t GRANDFATHER (a, Z)” as follows:

+-GRANDFATHER(a, Z) +FATHER(a, Y), FATHER( Y, Z) +FATHER(a, b), FATHER(b, Z) +FATHER(a, b), FATHER(b, c) +-cl.

The above execution starts from the query state- ment “+ GRANDFATHER(a, Z)“. This repre- sents a procedural call to retrieve suitable data to replace Z. No available data in the GRAND- FATHER relation exists to execute this call. In- stead, a pattern matching routine matches the goal, GRANDFATHER(a, Z), with GRANDFATHER (X, Z), the heading of “GRANDFATHER(X, Z) +- FATHER(X, Y), FATHER( Y, Z)“. During this matching process, the goal is replaced by “FA- THER(a, Y), FATHER( Y, Z)“, effectively invok- ing a new procedure and passing parameters. FA- THER(a, Y) is then executed and the result is the substitution of Y by b, based on the search through the FATHER relation. Again, b is passed to the next procedure and FATHER(b, Z) is executed. Now c is found for Z, and there is no more pro- cedure calls. The empty clause signals “STOP” for the program execution. Notice that a relation, de- fined for the data description, is used here as a pro- cedure invocation; moreover, a user is never aware of the matching and searching-he only states the query to invoke the PROLOG interpreter.

3. VIRTUAL DATABASE

First Order Logic, although powerful in repre- senting a database and performing deductions, is limited in that it can only reveal existing data and their relationships; no procedure is defined for data

Design of a virtual database 31

processing. PROLOG, on the other hand, defines procedures. These procedures, however, are exe- cuted primarily for data retrieval; no data pro- cessing is permitted. Data processing, as mentioned earlier, yields more meaningful data interpretation; yet currently it requires mixing two languages, and the ideas of data independence and query pro- cessing are violated.

In this section, the idea of the virtual database is introduced to overcome these problems. A virtual database is abstracted as consisting of two com- ponents: a conceptual virtual database, as a user’s view, and a virtual database schema, as a logical database structure. A conceptual virtual database is an extention of a relational conceptual database; additionally, it includes, as its name implies, non- existent or virtual data. These virtual data are either implied, through deduction, or created, through data processing, from some existing data. Their presence generally signals a need to have more information, as in the deduction case, or to enhance data interpretation, as in the data pro- cessing case. A user, however, cannot distinguish whether data are retrieved, implied or processed. An example of a conceptual virtual database and its relationships with the relational database and users are shown in Fig. 7. Notice that a user con- ceives his database as consisting of Rl, R2, R3 and R4, in which only R1 is instantiated, R2, R3 and R4 are virtual relations. This database will be discussed later in this section.

A conceptual virtual database is constructed from a virtual database schema in which relation- ships among relations in the conceptual virtual database, such as many-to-many, implication and data processing, are defined. A virtual database schema is defined as a set of database formulas which are either Horn clauses or processing clauses. Horn clauses, as discussed previously, de- scribe the many-to-many relationship and the im- plication relationships among relations. Processing clauses, on the other hand, are used to establish processing relationships. Syntactically, a pro- cessing clause is defined as an expression Ri -f--f Rj, where Ri, Rj are relations and f is a processing procedure or program. (To differentiate the impli- cation relationship -+ from the data-processing re- lationship -f +, the procedure is placed in be- tween the implication symbol.) Semantically, its interpretation is based on a concept in logic called a model. A model of a relation R, in our framework, is a set of data which instantiate R and make R true. Models of relations in a virtual database, of course,

user’s view user’s view user’s view

ru(srwEwr,couasE,nrorEsl.FIWru) Pl S~UDEYT.C(IIPSE.nlOTE~,F~Na) I lu sruoEwr,ccuas.sccm)

IU~fTUDEnT.c(XIRsE.Sccuf R4(fTUDEMT,CWR_$E.CRWE 1

relational database conceptual virtual clat.mase

Fig. 7.

need not necessarily exist explicitly; for example. in Fig. 5(a), FATHER(X, Y) has a mode1 while GRANDFATHER(X, Z) does not. The purpose to have a program in a virtual database, consequently, is to create models. Thus, Ri -f-+ Rj means that fis executed on the mode1 of Ri to generate a model of Rj, provided that the mode1 of Ri exists. Notice that a processing clause is not an Artificial Intelli- gence production rule [4, 51. In AI, a production rule such as Ri -+ f Rj would be interpreted as if Ri is true, f is executed with the result Rj true, i.e. f’ transforms predicate Ri into predicate Rj; f does not operate on the mode1 of Ri. In other words, there is no concept of models in AI production rules.

A processing clause relates two relations in a con- ceptual virtual database by a procedure and allows a virtual relation to be instantiated for query pro- cessing. As an example, consider the following database schema, using the three widely used pro- cedures in the relational model-JOIN, PROJECT and SELECT:

Rl(A, B); R2(B, C);

Rl(A, B), R2(B, C) + R3(A, B. C); R3(A, B, &PROJECT+ R4(A, C);

R4(A, &SELECT-+ R4(a, C).

Suppose only RI(A, B) and R2(B, C) are instan- tiated initially, and a query is made to find occur- rences of C for A = a. The above database schema can be interpreted to create a model for R4(a, C), as follows:

(1) JOIN or imply from Rl(A, B) and R2(C, D) to yield a relation R3(A, B, C),

(2) PROJECT R3(A, B, C) to R4(A, C), and (3) SELECT occurrences in P4(A, C) where A

= a. In this formulation, Rl and R2 are the only relations that are explicitly instantiated in the database. The occurrences of other relations, as a matter of fact, are created by the programs upon a query request. This approach certainly reduces data redundancies, and therefore, avoids many updating and consis- tency problems. As another example, consider the following database about students’ grades, dis- cussed in the introductory section:

Rl(STUDENT, COURSE, MIDTERM, FINAL); Rl(STUDENT, COURSE, MIDTERM, FINAL) -AVERAGE+ RZ(STUDENT, COURSE, SCORE); R2(STUDENT, COURSE, SCORE) -SORT+ R3(STUDENT, COURSE, SCORE); R3(STUDENT, COURSE, SCORE) -GRADE-+ R4(STUDENT, COURSE, GRADE).

In this example, only the mode1 of Rl is stored in the database. In response to a query for student’s grades, AVERAGE first computes the average score for each student and creates a model for R2. SORT then processes the mode1 of R2 and gener- ates a mode1 for R3. GRADE, at last, determines

32 SELD~NSHEN

student’s grades and comes up with a model for R4. Notice that this database will not be jeopardized, even if, as an example, John actually had 58.3, rather than 78.5, in the final of CS501, for we can simply change John’s final in Rl. In contrast, if Rl through R4 are all instantiated, as in a database without data processing capability, updating to maintain data integrity certainly is not an easy task. Furthermore, we can easily modify procedures to reflect policy changes; for example, change the sim- ple average to the weighted average and change the grading policy from one-third A’s to no A’s in a course. Again, the database remains intact.

Given the above discussions, we can formalize a virtual database, as follows:

Definition: A virtual database is a triple (R, T, S), where R is a nonempty set of relations, T, is a nonempty set of implication relationships, denoted as +‘s, and data-processing relation- ships, denoted as -f -+‘s, where f is a pro- cedure; S is a set of mapping represented as either Ri, ... Rj -+ Rk or Ri -f + RJ’, where Ri, Rj and Rk C R. R is called a conceptual virtual database and S is referred to as a virtual database schema.

4. VIRTUAL QUERY PROCESSING

In the previous sections, the virtual database is formalized to integrate data, programs and their re- lationships. A virtual database, using virtual rela- tions, broadens a user’s conceptual view of a da- tabase without costly data redundancies; thus, updating a database is simplified and data integrity can be enhanced. Moreover, a virtual database sys- tem enables information processing to be auto- mated, i.e. a user does not need to write programs for accessing virtual data. Automating information processing requires a database system to be capable of planning and coordinating data retrieval, deduc- tion and program execution activities. The planning and coordinating of virtual query processing are dis- cussed in this section, as query evaluation and query execution, respectively.

4.1 Query evaluation A query, based on a user’s conceptual database,

needs to be evaluated not only for its syntactic cor- rectness but also, more importantly, for its seman- tic interpretation. The semantic interpretation of a query is a solution plan for processing that query and can be expressed in several ways. As an ex- ample, in the relational model, a solution plan is represented as sequences of JOIN, PROJECT and SELECT operations, as these are basic procedures for processing a relational query. In PROLOG, a solution plan includes successive matching and re- placement processes and, in other database models, a solution plan might consist of a set of DBMS com-

mands. Obviously, a solution plan for a virtual query is more complex, as all the data processing, data retrieval and data deduction activities need to be considered and coordinated. As a matter of fact, the solution plan of a virtual query is realized by a tree structured scheme called a query tree.

Definition: A query tree for a query Q in a virtual database (R, T, S) is a triple (N, T’, Q). N is a nonempty set of nodes where each node is a nonempty subset of R, and if nl and n2 are different nodes, then nl II n2 = 0; T’ is a non- empty set of implication and data-processing relationships, T’ C T, one node is singled out as the root Q, and each node has associated with it one natural number (its rank) satisfying:

(a) the rank of Q alone is 1; (b) the nodes of rank k are those nodes which bear one element of T to exactly one element of a node with rank k - 1, where k > 1.

Conditions (b) rules out circular paths, and also means that each node besides Q bears an element of T to something or other. This definition is dif- ferent from the definition of a tree in two ways: first, T’ contains, rather than a single binary rela- tionship, a set of implication and data-processing relationships, and second, a node is related to an element of another node instead of another node.

Initially, a query, just like a query in PROLOG, is expressed as an uninstantiated relation and is in- terpreted as a procedure, calling for a model in- stantiation. If a model can be found in the database, then the query is processed in a way similar to the conventional query processing. If no model exists, the query processor searches through the virtual database schema and starts to develop a query tree for the query. The procedure to derive such a tree is explained below:

(1) Set the initial query as the root relation Q with rank 1;

(2) If Ri, . . . , Rj -+ Rk in the database schema and Rk is in a node with rank k in the query tree, then the rank of Ri, . . . ,Rjissetask + 1,andeach of Ri, . . . , Rj is related to Rk by + in the query tree;

(3) If Ri -f+ Rj in the database scheme and Rj is in a node with rank j in the query tree, then the rank of Ri is set as j + 1, and Ri is related to Rj by -f + in the query tree;

(4) GOT0 2; (5) STOP.

An example of a query tree for a query made against R7 is shown in Fig. 8(b), given a database schema in Fig. 8(a). In Fig. (8), Rl through RS are the in- stantiated relations, all others are virtual relations. In the example, a query which requires a model of Rll is requested. Failing to instantiate Rll, the query processor sets Rll as the root of the query tree and looks for Rll’s descendents in the data-

Design of a virtual database 33

Rl; Rll

R2; f R3; R4; (R9,kO)

RS; P '; Rl,RZ --> R6;

R3 -f-> R7; (RS,'R~) (R;R~)

R4 -9-s R8; C? RS,R6 --> R9;

?

R7.R8 --> RlO; I f ?

R9,RlO --> Rll. (Rl,RZ) R3 R4

avirtualdacabaseschema aquerytreeforR11

Fig. 8(a). Fig. 8(b).

base schema. R9 and RlO are located and they are related to Rl 1 by -+. R9 and RIO do not have models either, and they, in turn, become the new query and the above procedure repeats itself again. The procedure stops when no more replacements can be found. If all the leaves or terminating nodes in the query tree are instantiated relations, then a solution plan is derived; otherwise, the query does not have an answer. In Fig. 8, Rl through R.5 are terminating nodes and, therefore, Rl 1 has a model from the database.

A query tree demonstrates how a model for a query can be obtained. As an example, in Fig. 8(b), to create a model for R 11, the query processor first retrieves the models for Rl through R5. Procedures fand g are then called and executed to generate the models for R7 and R8, respectively. The model of R6 is inferred by the deductive processor from the models of Rl and R2, and similarly, the model of R9 and RlO are from the models of R5, R6 and R7, R8. Finally, the model of Rll is established from the models of R9 and RlO, again, by the deductive processor.

4.2 Query execution A query tree, derived from a database schema,

represents a solution plan for processing a query. Since there might be several programs and many deduction processes involved in query processing, and their relative execution speeds are unknown, a coordination scheme must be devised to determine when to execute a program. In this section, a con- ceptual framework, called the program library of a query tree, is discussed for program coordination and execution.

Definition: Let A be a finite set of programs, called a set of “base programs” in a query tree. The program library PL of a query tree is de- fined as follows:

(I) every base program is in PL; (2) for every s1 and p in PL, 0; p is in PL

(do fl then do p); (3) for every CI and p in PL, finl3 is in PL

(0 and p are executed parallelly).

The purpose of a program library for a query tree is to synthesize base programs into a set of new programs that represents permissible execution se-

quences. For example, in Fig. 8(b), there are six base programs, two procedures f and g and four deductive processes, represented in the query tree as either an implication or a data-processing rela- tionship. Obviously, these base programs cannot be arbitrarily selected for execution; particularly, they need to coordinate themselves, as some programs must wait for the results from other programs. A program library is a set of programs created to rep- resent these execution sequences. This is achieved by concatenating base programs, as is defined in 2. and by synthesizing parallel base programs, as is defined in 3. Thus, a program library, when estab- lished, represents a set of coordinated or synchro- nized programs to process a query. An example of a program in PL is w;xny;z, to be interpreted as: M‘ is followed by two simultaneously executed pro- grams, x and y, and xny is followed by z, where w,x,y,z are base programs. Notice that the recur- siveness in the definition allows all the possible combinations of base programs to be considered as candidates in a program library. In practice, of course, we are only interested in programs with ex- ecutable sequences. Procedures to generate exe- cutable programs in a program library are discussed below.

To see how programs can be synthesized from a query tree, let’s observe that the execution of a base program must be accompanied by the presence of a model for the corresponding node in a query tree. For example, in Fig. 8(b), a model for R3 must exist before f can be executed. Thus, each node can be associated with a condition, denoted by the lower- case symbol from that node, to signal the presence of a model for program execution. For example, r3. when it is set true, means fcan be executed. Sim- ilarly, the conditions for executing four deductive processes are (rl, r2), (r5, r6), (r7, r8) and (r9, r10). respectively. Using these conditions, nonsensical program execution can be prevented. Thus, the im- plication process associated with RI and R2 takes place only if (rl, r2) is true. The result of this ex- ecution sets r6 true and, in turn, triggers another implication process if r5 is also true. No program execution in a program library is allowed unless the corresponding condition is set true. Since there is a one-to-one correspondence between a base pro- gram and its associated condition, we can simplify notations by using these conditions to denote the base programs in a program library and synthe- sizing them to describe a program library. As an ex- ample, the six base programs in Fig. 8(b) can be represented as (rl, r2), r3, r4, (r5, r6), (r7, r8) and (r9, r10). The next task, then, is to synthesize these base programs into the program library, using a con- cept called a branch of a query tree.

Definition: A sequence of nodes (... ; nk; . . . ; n,) is a branch of (R, T, Q) iff

(1) no = Q; (2) nitni-, for i = 2, 3, . . . , where t is an

34 SELD~NSHEN

element of T and nitni_ I means that ni is related to exactly one element of Iti_, .

(3) if there is an x such that xtni, then the sequence has an (i + 1)th element.

A branch can be easily identified by starting from the root node and going down the implication or data-processing relationships in a query tree to a terminating node; then, the order of nodes obtained is reversed. For example, in Fig. 8(b), the (R3; (R7, R8); (R9, RIO); Rll) branch is obtained by revers- ing Rll +- (R9, RIO) + (R7, R8) tf-R3. Simi- larly, there are two other branches in Fig. 8(b): ((Rl, R2); (R5, R6); (R9, RlO); Rll) and (R4; (R7, R8); (R9, RlO); Rll).

Each branch indicates the order for executing base programs. Using the conditions to denote as- sociated base programs, we can convert a branch into a concatenated program in the program library by changing upper-case node symbols to lower-case condition symbols. As an example, the three con- catenated programs in Fig. 8(b) are:

(1) (rl, r2); (r5, r6); (r9, r10); rll (2) r3; (r7, r8); (r9, rl0); rll and (3) r4; (r7, r8); (r-9, rl0); rll. In summary, the concatenated programs in a pro-

gram library are obtained as follows: (1) Identify the branches in a query tree; (2) Substitute condition symbols into node sym-

bols in each branch. Having concatenated base programs, we can now synthesize them into parallel executed programs, as follows:

(1) If a; . . . 6; X; . . . and c; . . . ; d; x; . . . are two concatenated programs in a program library, then they are synthesized into (a; . . . ; b)~(c; . . . ; d); x; . . . ) meaning that (a; . . . , b) and (c; . . . , d) are ex-

ecuted in a parallel manner and they are followed by x when their executions are completed.

(2) GOT0 1; (3) STOP.

The output from this procedure is a program in PL that maintains sequences of base programs to pro- cess a query. Again, using Fig. 8(b) as an example, the concatenated programs are synthesized as [r3ar4; (r7, r8)]n[(rl, r2); (r5, r6)]; (19, r10); rll. This is interpreted as follows: six base programs are executed to process a query; among them r3 and r4 are executed in a parallel manner and followed by (r7, r8); simultaneously executed with [r3m4; (r7, r8)l is [(rl, r2); (t-5, t-6)] which is composed of two successive base programs: (rl, r2) followed by (r5, t-6); the last program executed is (r9, r10) which creates a model for R 11.

The conditions in a query tree, as discussed above, allow program execution to be coordinated. One interesting side effect of using these conditions is that program execution can be saved, and it is not necessary to repeat the same program execution for processing a similar query. For example, in Fig. 8(b), once a model of Rll is created and saved, rl

through rll will remain true. Therefore, the next similar query about Rll can be directly executed by data retrieval alone; there is no need to repeat every program execution. Part of these conditions would become false, however, if some data in the database are updated. For example, r6, r9 and rl 1 become false if data in Rl are updated, for R6, R9 and RI 1 are in the same branch of Rl . In this case, a suitable program synthesized in PL to execute a query about Rll is (rl, r2); (r5, r-6); (r9, r10); rll. Since rl0 is not affected by rl and remains true when Rl is updated, r3m4; (r7, r8) does not have to be executed. The details of maintaining a set of consistent conditions are certainly an important im- plementation consideration. They will not be dis- cussed, however, in this paper.

5. CONCLUSION

This paper has proposed design of a virtual database (1) to broaden users’ views about a da- tabase and (2) to enhance better interpretation of data. The major thesis is that a user should be al- lowed to define and access any information related to a database-even if the information is not ex- plicitly stored in the database; furthermore, this process of generating new information should be automated in a sense that users do not have to write programs, maintaining the notion of data independ- ence and query processing.

One possible application of this framework would be design of a decision supporting system where decisions are represented by virtual relations. De- cision making, which must be based on some facts and a decision model, is then considered as a pro- cess to instantiate a virtual relation, involving data retrieval, data implication and processing. As an example, consider an academic DSS to make de- cisions about students’ grades. The decision, made by a grading policy and students’ scores, is the re- sult of instantiating a virtual relation of students’ grades, and thus, as demonstrated in the paper, can be supported by a virtual database. The other ad- vantage which can be exploited is that a decision, modeled as a virtual relation, can now easily re- spond to environmental changes, as updating data or decision policies automatically results in a new decision. Obviously, this is an area in need of fur- ther research.

111

121

131

REFERENCES

E. F. Codd: A relational model for large shared data banks. Comm. ACM 13(6), 377-387 (June 1970). C. L. Chang: DEDUCE 2: further investigations on deduction in relational data bases. In: Logic andData Bases (Edited by H. Gallaire and J. Minker), pp. 201- 236. Plenum Press, New York (1978). I. Futo, F. Darvas and P. Szeredi: The application of PROLOG to the development of QA and DBM systems. In: Logic and Databases (Edited by H. Gal- laire and J. Minker), pp 347-376. Plenum Press, New York (1978).

Design of a virtual database 35

[41 C. Holsapple, S. Shen and A. Whinston: A consulting J. Minker), pp. 149-177. Plenum Press, New York system for data base design. Information Systems I, (1978). 281-296 (1982). [7] J. A. Robinson: Machine oriented logic based on res-

151 N. J. Nilsson: Principles of Artificial Intelligence. olution principle. JACM 12, 23-44 (1965). Tioga Publishing, Palo Alto, CA (1979). [8] S. Shen and A. Dutta, An Extended Database with

[61 R. Reiter: Deductive Q-A on relational data bases. Data Processing Capabilities, Proc. 13th Annual In: Logic andData Bases (Edited by H. Gallaire and Pittsburgh Conf., April 1982, pp. 559-564 (1982).