Data Management: Databases and Organizations Richard Watson Summary of Selections from Chapters 9, 10 prepared by Kirk Scott 1

1

Data Management: Databases and Organizations

Richard Watson

Summary of Selections from Chapters 9, 10 prepared by Kirk Scott

2

Chapter 9, The Relational Model and Relational Algebra

• Generally speaking, the contents of this chapter should not be too difficult

• The idea is that most of the information has been introduced inductively in the foregoing sections

• This chapter is puts some of the earlier information into context and sums up the idea of relational databases

3

Background

• Databases existed before the development of the relational model

• They were based on networks or hierarchies• In other words, their implementation was

based on linked data structures• These kinds of databases were not easy to

understand or code

4

• The general idea of a relational model was familiar, but it was unclear whether it was a useful alternative.

• The main apparent problem was performance.• Linked code can run quickly.• Performing joins, for example, by traversing

two tables is not very efficient.

5

• E. F. Codd is recognized as the main figure in the development of the relational model as a practical alternative to existing dbms’s.

• One of the things needed in order to make the relational model practical was the development of efficient algorithms for performing operations.

• Equally important, as it turns out, was the development of the theory so that it was clear what a database was all about

6

• Codd made the following observations about existing systems:

• 1. They forced programmers to write low level code

• This meant that queries were more difficult to write, took longer to write, and typically required debugging because they were error prone.

7

• 2. No commands were available for processing multiple records at a time.

• Although efficient algorithms were needed before the relational model could be adopted, the problems with existing algorithms were clear.

• They consisted of using procedural coding statements like loops in order to traverse linked data structures.

• By definition, inside the loop one record at a time was accessed.

• The relational model was inherently set-based and suitable for the implementation of set level commands.

8

• 3. The existing systems were not amenable to ad hoc querying.

• Trained programmers are needed in order to write procedural code.

• SQL is simple enough that an end user can learn it (maybe).

• Also, the development time for an SQL query is short enough that it becomes practical to write one-time queries, not suites of programs.

9

• In addition to the three observations about existing systems and the contrasts with the relational model already listed, Codd was interested in achieving three goals.

10

• 1. Data independence• The users of databases should not have to worry

about how the data was physically stored.• They should be free to envision the data simply as

a collection of related tables, regardless of the physical implementation.

• Any physical level questions would be at the operating system or database administrator level.

11

• 2. Communicability• The basic idea here is that the relational

model, based on tables, records, keys, and values, is relatively easily understood by both users and programmers, making it easier for clients and developers to work together.

• This is in marked contrast to earlier database models.

12

• 3. Set processing• This is basically just a repetition of information

given above.• The beauty of the relational model is that it

allows queries to be non-procedural and still supports the retrieval of multiple records.

• The model is “tell what you want” rather than “tell how to get it”.

13

The Major Components of the Relational Model

• The relational data model has three major components:

• Data structures• Integrity rules• Operators used to retrieve, derive, or modify

data

14

Data Structures

• The following terms summarize the data structures that the relational model is based on:

• Domain• Relation• Relational database• Primary key• Candidate key• Alternate key• Foreign key

15

Integrity Rules

• These are the integrity rules of the relational model:

• Entity integrity• Referential integrity

16

Manipulation Languages

• The operators for manipulating the data in a database are embodied in DDL (data definition language) and DML (data management language)

• There are essentially four options when it comes to relational databases:

• SQL• QBE• Relational algebra• Relational calculus

17

Relational Algebra

• The names of the languages relational algebra and relational calculus emphasize the mathematical underpinnings of the relational model

• Relational calculus will not be pursued at all

18

• Relational algebra will be pursued for two reasons:

• It provides a useful vocabulary for talking about queries

• Even without delving into the theory, it is possible to make some useful observations about the necessary contents of a query language based on relational algebra concepts

19

• Relational algebra is fundamentally based on 8 operations:

• Restrict (select): This picks a subset of rows from a table

• Project: This picks a subset of columns from a table:

• Product: This forms all possible pairings of the rows of two tables

20

• Union: This forms a vertical combination of the rows of two tables

• Intersect: This finds the rows that appear in both of two tables

• Difference: This finds the rows appear in one table but not another

• Join: this finds a subset of rows of a product, typically where corresponding field values match

21

• Divide: • Relational divide is not as simple as the other

concepts and has not been explained yet• For the sake of completeness it will be

explained in the following overheads• After explaining division, the discussion will

return to relational concepts in general

22

Relational Division

• The plan for this section is to explain relational division with the help of a few examples.

• These examples are actually the last four questions on the assignment for this unit.

• The answers to these questions will be given here as part of the explanation.

23

• If you choose to do the assignment, your goal should not be to copy the answers given.

• Instead, after having read the explanatory material, hopefully enough of it will stick in your memory that you can come up with the correct answer on your own.

• If not, you can refer back to the explanations again.

24

• TableX, TableY, and TableZ are given for the questions/examples.

• They are shown on the following overhead.

25

TableX

attribute: xid attribute: xone

a g

b h

c i

d i

TableYattribute: xid attribute: zida ra sa tb rb sc rd rd sd t

TableZ

attribute: zid attribute: zone

r l

s m

t n

26

• Strictly speaking, only two tables are needed in order to do division.

• Specifically, in this example only TableY and TableZ are needed.

• In other words, we are specifically interested in the quotient of TableY and TableZ, that is,

• TableY DIVIDED BY TableZ.

27

• TableX is included in the example because it may help visualize the relationship between TableY and TableZ.

• TableY is like the table in the middle between TableX and TableZ.

• In other words, TableY is a subset of a product of TableX and TableZ

• Dividing TableY by TableZ won’t yield TableX—it will yield a subset of TableX

28

• Part of the goal of this discussion is also to show how relational division can be accomplished in SQL.

• As it turns out, division in SQL is done by means of double NOT EXISTS queries, like the formation of FOR ALL queries.

• Because the familiar structure of such queries involves double nesting with three tables, it is convenient to also have TableX available to work with.

29

• Now consider TableY and TableZ. • The first column of TableY is the field xid. • TableY and TableZ have the field zid in common.• The second field of TableZ, field zone, does not

play a role in the division. • The division of the two tables is based on the

common field, zid. • The result of the division will be in terms of the

first field in TableY, xid.

30

• The definition of relational division can be explained using these two tables as an example.

• The verbal expression of what TableY DIVIDED BY TableZ is supposed to produce as a result is this:

• It should find all of those values of xid, the first field in TableY, where those values of xid are matched with every value of zid, the common field, that appears in TableZ.

31

• The verbal expression can be restated in this way:

• The division of the two tables should find those values of xid in TableY that are in a Cartesian product with the values of zid in TableZ.

• The division operation will not include in the results any values of xid in TableY that are not matched with every value of zid in TableZ.

32

• The table that would result from a division query that divided TableY by TableZ on the fields TableY.zid and TableZ.zid, respectively, would consist of one column containing xid values taken from TableY.

• TableX, TableY, and TableZ are repeated on the next overhead.

• The result of dividing TableY by TableZ is shown on the overhead following that one.

33

TableX

attribute: xid attribute: xone

a g

b h

c i

d i

TableYattribute: xid attribute: zida ra sa tb rb sc rd rd sd t

TableZ

attribute: zid attribute: zone

r l

s m

t n

34

a

d

35

• In order to help the idea stick, another example is explained here verbally without completely illustrating it with tables.

• Suppose some TableR was the full Cartesian product of the xid values in TableX and the zid values in TableZ.

• What would the result be of dividing TableR by TableZ on their common field zid?

36

• Except for the fact that it's stated verbally rather than completely illustrated, this question is easier than the first one.

• In this example TableR replaces TableY.• If TableR is the Cartesian product of TableX.xid

and TableZ.zid, then every xid value in TableR will be in the result of TableR divided by TableZ.

• In other words, the actual results of the division would be the table shown on the next overhead.

37

abcd

38

• Relational division is by nature a binary operation. • Using SQL syntax, relational division can be

accomplished with double NOT EXISTS. • It turns out that double NOT EXISTS on three

different tables is easier to keep track of than double NOT EXISTS on two tables, where one table appears once and the other table appears twice in the query.

• That is one reason why three tables were given for the purposes of these explanations.

39

• The next task is to come up with an SQL query that will find those TableX.xid values that are paired in TableY with all of the existing TableZ.zid values.

• In other words, find those TableX.xid values where there does not exist a TableZ.zid value that it's not matched with in TableY.

• Notice that the desired results can be phrased as "for all" or as a double negation.

40

• This is the indication that in SQL the desired result can be obtained with a double NOT EXISTS query.

• If this query is written correctly, the result set of TableX.xid values will equal the set of TableY.xid values that would result from dividing TableY by TableZ on the fields TableY.zid and TableZ.zid, respectively.

• The desired query is shown on the next overhead.

41

• SELECT xid• FROM TableX• WHERE NOT EXISTS• (SELECT *• FROM TableZ• WHERE NOT EXISTS• (SELECT *• FROM TableY• WHERE TableX.xid = TableY.xid• AND TableY.zid = TableZ.zid));

42

• Phrased informally, as was done in the unit that covered the double not exists queries, this query asks for those values of xid in TableX where there is not a zid value in TableZ that it's not matched with, through the table in the middle, TableY.

43

• Notice that this query follows the pattern for double NOT EXISTS queries where

• the first query opens the left base table, • the second query opens the other base table, • and the third query opens the table in the

middle. • For reasons of scope, both of the joining

conditions are in the third query.

44

• It is also possible to write such a query using just the two tables that are involved in the division.

• When considering the double NOT EXISTS query an example was given where all of the relevant fields were in the table in the middle and it could opened three times with aliases in order to achieve the desired results.

45

• In the division example the table in the middle, TableY, is both the thing that is being divided (the dividend) and the thing that has the result field in it (the quotient).

• TableZ is the thing you're dividing by (the divisor).

46

• Using the terms for division as the aliases, TableY can be substituted for TableX in the previous example.

• This is possible because the result field of interest is xid, which is in TableY as well as TableX.

• The desired query is shown on the next overhead.

47

• SELECT DISTINCT xid• FROM TableY AS Quotient• WHERE NOT EXISTS• (SELECT *• FROM TableZ AS Divisor• WHERE NOT EXISTS• (SELECT *• FROM TableY AS Dividend• WHERE Quotient.xid = Dividend.xid• AND Dividend.zid = Divisor.zid));

48

• Finally, in review, what does the division operation have to do with the Cartesian product?

• In other words, in what way can division and product be viewed as complementary operations in a relational system?

• If TableY were the full Cartesian product of the xid from TableX and the zid from TableZ, then TableY divided by TableZ would return all of the xid values in TableX.

• In this case it's clear that these two operations are actually inverses.

49

• The special case of the first example that was used to illustrate division is actually the more common case.

• TableY is not a full Cartesian product of TableX and TableZ.

• Only some of the values of xid have been matched with all of the values of zid.

• Division is also defined in this case, as explained above.

50

• In essence, division finds the inverse for any values that could or would have been the result of a Cartesian product.

• Relational division ignores those values that did not participate in a full Cartesian product.

51

• As you may already have noted, relational algebra is not the same as arithmetic algebra.

• If it were, we would be working with numbers, not relations.

• It seems that in the special case, which is the common case, relational division is not a full inverse.

• However, there is another way of viewing this.

52

• When doing integer division, there is a remainder.

• In a sense, when doing relational division there is also a remainder.

• Those values in TableY which did not participate in a Cartesian product are left over

• Those values are in some sense the remainder upon relational division.

53

• For those interested in things mathematical and logical, it is interesting that the SQL syntax for implementing relational division is the same syntax for implementing the logical quantifier FOR ALL.

• Pursuing an explanation of this aspect of the situation is beyond the scope of these notes.

54

Relational Algebra

• This, then, is the full list of the eight relational algebra operations:

• Restrict (Select)• Project• Product• Union• Intersect• Difference• Join• Divide

55

A Primitive Set of Relational Operators

• The truth is that there are only five basic relational operations:

• Restrict• Project• Product• Union• Difference

56

• The five basic operations are basic for the following reason:

• They cannot be defined in terms of any of the other basic operations

• Put another way, the effects they achieve cannot be achieved using any other combination of basic operations

57

• The assertion that the five basic operations are in fact basic will not be demonstrated.

• However, for those who are interested in the question, the following can be noted:

• The five basic operations can be viewed as corresponding to basic operations in a simple algebraic system.

• To a mathematician, the “basicness” of the operations would not be in doubt.

58

• Conversely, if those five are basic, then join, intersection, and division are not basic.

• Showing that these three can be defined in terms of the other five will be pursued.

59

• It is relatively easy to define a join in terms of the basic operations.

• It is a Cartesian product followed by a restriction and a projection

60

• The fact that intersection is not basic can be illustrated with the help of some Venn diagrams.

61

• A intersect B =• (A union B) – (A – B) – (B – A)

B - A

B A

A - B

62

• Just as division itself was a bit messy to explain, explaining why it isn’t a basic operation is also a bit messy.

• Let TableX, TableY, and TableZ again be given as a starting point for the discussion.

• Let TableC = the Cartesian product of TableX and TableZ.

63

• Let the difference TableC – TableY be considered.

• All xid values that would be in the result of the division would be eliminated.

• Remainder values would be eliminated.• However, those xid values that were in TableC

– TableY would be the same as those xid values that were in the remainder.

64

• The remainder values, by definition, are those that didn’t match with all of the zid values.

• That’s how come there will be remainder values left after the set subtraction

65

• Now do a projection on TableX on the xid column, giving a single column table, TableAllXid, containing all values of xid.

• Also do a projection on (TableC – TableY) on the xid column, giving a single column table, TableRemainders, containing all of the remainder values of xid.

• Then the result of the division would be TableAllXid – TableRemainders.

66

• In other words a pair of subtractions and projections can be used to obtain the set of values that a division would return.

• Division is not a basic operation because it can be accomplished by a combination of basic operations.

67

Who Cares About the Primitive Operators?

• Some database management systems used relational algebra as their query language.

• The Quel language of Ingres is an example.• This has largely been supplanted by SQL.• The point of the basic relational operators is

that a system with a language that can accomplish what the five basic operators can accomplish is known as relationally complete.

68

• In other words, all data stored in the database is retrievable.

• All systems can be measured against this standard.

• Theoretically speaking, SQL is a bit of a syntactical mish-mash.

• Whether successful or not, the designers’ goal was to make it friendly to users, not necessarily theoretically beautiful.

69

• In any case, SQL is relationally complete.• This is easily established by showing that it supports

the five basic operations.• The WHERE clause implements restriction (selection).• The listing of the desired fields in a SELECT statement

implements projection.• A join without a joining condition implements the

Cartesian product.• SQL has a UNION operator, so it implements union.

70

• Finally, relational subtraction is implemented through NOT EXISTS.

• Let relations A and B be given.• Let A and B be union compatible.• In other words, they have the same set of

attributes.• For the sake of illustration, let the attributes

simply be named 1, 2, …, n.

71

• Then what SQL query would find A – B?• SELECT *• FROM A• WHERE NOT EXISTS• (SELECT *• FROM B• WHERE A.1 = B.1 AND A.2 = B.2 AND …• AND A.n = B.n)• In other words, find all of those records of A where there is no

record in B that is exactly the same.• Any record of A where there was a record in B that was exactly the

same would be subtracted out.

72

• As you know, SQL also supports joining with separate syntax.

• This is part of what makes SQL a mish-mash, but in this instance, it certainly helps make SQL more user friendly.

73

A Fully Relational Database

• Do no confuse the phrase “relationally complete”, just explained, with the phrase “fully relational”.

• As stated at the beginning, a relational dbms has three components:

• Structures: domains and relations• Integrity rules: entity and referential• A manipulation language: DDL, DML.• For example, relational algebra, or something else

which is relationally complete.

74

• The book notes that there are commercially available systems that advertise themselves as relational but which have certain limitations.

• For example, the systems may have an implementation of SQL but not support domains or integrity rules.

• The question is, is it fair to call these systems relational?

75

• The answer is that they are not fully relational.• E. F. Codd was one of the people instrumental

in developing relational databases.• He came up with a list of 12 characteristics that

could be included in a complete relational dbms implementation, and which such a system should have.

• These are the accepted measuring stick for whether a system is fully relational.

76

• When considering the current state of the dbms market it is worth noting that Codd’s rules were enunciated in 1985.

• When reading the rules, it may be helpful to read them “negatively.”

• In other words, for every rule there is or has been a dbms advertised as relational that did not have that characteristic.

77

Codd’s Rules for a Relational Database

• 1. The information rule• Regardless of the underlying implementation, from

the user’s point of view, there is only one logical representation of data in a database:

• Values stored in fields stored in tables.• 2. The guaranteed access rule• Every value in a database has to be accessible by

specifying the table name, the column name, and the primary key value of the row in which it’s stored.

78

• 3. Systematic treatment of null values• The system has to support the semantics of

null. • It can’t rely on devices such as storing blanks

or 0’s or other default values to signify null. • The system also has to support the syntax of

null in the query language.

79

• 4. Active online catalog of the relational model• The system has to maintain an online catalog. • This will include tables like SYSTABLE, SYSCOLUMN,

SYSINDEX, etc. • It should be possible for the user to query the

catalog and find out all of the information about a given user database.

• Note that informally a data dictionary is at least a partial representation of the contents of the system catalog.

80

• 5. The comprehensive data sublanguage rule• The system has to have a language or languages that support

the following:– Data definition– Data manipulation– Security and integrity– Transaction processing– Interactive querying and querying embedded in a programming

language• Even if a graphical user interface is provided, a text based

language supporting these functions has to be provided• Note that SQL meets all of these requirements

81

• 6. The view updating rule• The dbms has to be able to update any view that is

theoretically updatable.• Comment mode on:• Note that when views were covered, it was

explained that a change to a view should cause a change in the underlying table(s).

• This rule tells you that some systems have not implemented views in this theoretically correct way.

82

• 7. High-level insert, update, and delete• The system has to support set-at-a-time operations.• In other words, it has to be possible to insert, update,

and delete multiple records at a time.• Comment mode on:• Note that this is a swipe at graphical user interface-

only systems.• Without a real language, like SQL, it is unlikely that a

graphically based system will be able to support multiple inserts, updates, and deletes.

83

• 8. Physical data independence• The logical appearance of tables and data to users

will not change even if there is some change in their physical storage.

• For example, a database may be ported to a different machine, hard drive, etc.

• As long as the dbms is the same, the db should seem unchanged.

• This is also true for changes such as adding indexes.

84

• A user may notice a change in performance, but every query should still run, and it should not be necessary for the user to write queries with syntax that specifies that an index should be used when executing it.

• The system itself is responsible for all access issues at the physical level.

85

• 9. Logical data independence• Information preserving changes to the base

tables should not affect queries or applications.

• For example, adding a new table to a db should in no way affect any pre-existing applications.

86

• 10. Integrity independence• Integrity constraints should be part of the dbms’s

function.• Application programs should not have to contain

the logic for maintaining the constraints.• It should be possible to change the constraints in

the system without affecting existing applications.• Note that this should not be confused with data

integrity, which is a user problem.

87

• 11. Distribution independence• If a dbms advertises itself as distributed, the

distribution should be entirely transparent.• In other words, all tables, data and

applications should be accessible and work in the same way as they do without distribution, without any changes needed on the part of the user.

88

• 12. The non-subversion rule• It should not be possible to get around the

security or integrity constraints by using some other interface or access into the database.

89

• Rule 0• At a later time Codd also stated this rule:• The dbms should make it possible to manage a

database entirely through its relational capacities.

• In other words, you may supply a graphical user interface or some user tools that are not explicitly relational, but you also have to provide the relational interface.

90

• By way of explanation, the author now introduces another phrase, “totally relational”.

• The idea is that the system won’t allow non-relational tools to subvert the database.

• It also has a complete set of relational tools to manage the database.

• If these two conditions are met, along with the other 12 (plus rule 0), the dbms is totally relational, even though it may also provide other kinds of interfaces for convenience.

91

Chapter 10, SQL

• Chapter 10 in the book reviews SQL syntax and then presents some additional information

• The syntax review will be ignored• The additional information will be summarized

92

• SQL allows for the creation of a user defined function

• The syntax is CREATE FUNCTION…• the specifics aren’t important• The general idea is that the user can create a

simple numerical/arithmetic function

93

• SQL allows for the creation of a user defined procedure

• The syntax is CREATE PROCEDURE…• the specifics aren’t important• The general idea is that the user can package

together a sequence of SQL commands/operations/queries in order to support multi-part transactions

94

• SQL allows for the creation of a user defined trigger• The syntax is CREATE TRIGGER…• the specifics aren’t important• The general idea is that the user can create a type of

stored procedure which is automatically triggered when some action is taken on the database such as inserting, updating, or deleting the rows of a table

• Triggers can be used to enforce business rules, data integrity checking, transaction logging, etc.

95

• SQL supports security by making it possible to grant or revoke the ability to take certain actions to individuals or groups of users

• This is the basic syntax:• GRANT privilege(s) ON object(s) TO user(s)

[WITH GRANT OPTION]

96

• These are the privileges that apply to base tables and views:

• SELECT, INSERT, UPDATE, DELETE• These are the privileges that apply only to

base tables:• ALTER, INDEX• It is also possible to specify the following:• ALL PRIVILEGES

97

• Users can be lists of userid’s or potentially all users, PUBLIC

• The WITH GRANT OPTION tells whether or not a user who has been granted a privilege also has the right to grant it to another user

• Privileges can be withdrawn with REVOKE• If REVOKE is issued on a user who granted a

privilege to another user, the privilege is also revoked from this other user

98

• The catalog• The system catalog was touched on briefly in the

previous chapter• The catalog is a db in its own right• By querying tables like SYSCATALOG, SYSCOLUMNS,

SYSINDEXES, etc., it is possible to find out everything there is to know about the databases recorded in the catalog

• Note: It is a mystery why SYSCOLUMNS is plural rather than singular in this discussion

99

• Natural Language Processing• Some vendors may offer natural language processing as a

feature of their dbms• This would allow users to write queries in English• The system would translate them to SQL• This is problematic because of the possible ambiguities in

English• It is also possibly problematic because a user who doesn’t

understand the database well enough to apply SQL to it may not be able to form clear, meaningful queries against the database in English

100

• Connectivity, ODBC, JDBC, etc.• ODBC stands for open database connectivity• This is a set of standards/technology with the

following purpose:• In a client server environment, a client can use a

server database where the server dbms may be one of several different kinds

• This is accomplished by defining one standard interface and writing a driver for each kind of dbms which supports the common interface

101

• Embedded SQL• This topic will be covered in greater detail at the end of

the course when considering PHP• SQL can be used as a stand-alone language for ad hoc

queries• Procedural programming languages also have syntax

allowing for SQL statements to be embedded in them• This allows a program to process the results of a query,

for example• It also allows a program to enter data into tables

102

• SQL standardization• SQL was first standardized in the 1980’s• For example, there was a standard known as

SQL-89• SQL-92, also known as SQL2 is the current gold

standard• In other words, most vendors support this

standard, potentially with additional features

103

• SQL-99 added object-oriented features• It is not clear yet whether vendors will follow

this standard or go their own way• It’s also not clear whether it’s an improvement

to keep adding new features to a standard that has been relatively simple and successful

• SQLJ refers to another direction taken in SQL standardization, trying to integrate it with Java

104

Summary

• Purists may quibble about one or more features of SQL

• Also, SQL keeps on developing and it’s not clear all of the developments will succeed in the marketplace

• However, the core of SQL has been around for some time

• There is no sign that SQL is going to go away any sooner than relational database management systems are going to go away.

105

The End

106

• Why is there a remainder in relational division?

• What's left behind are those values that couldn't have been the result of a product in the first place, because they are not matched with all of the other values.

• In any event, relational division is certainly related to, and complementary to the operation of finding a product.

Documents

Data Management: Databases and Organizations Richard Watson Summary of Selections from Chapters 9, 10 prepared by Kirk Scott 1